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Abstract 

Given the constant rise in quantity and quality of data obtained from neural systems 
on all scales, information-theoretic analyses became more and more popular over the last 
decades in the neurosciences. Such analyses can provide deep insights into the functioning of 
such systems and also be of help in the characterization and analysis of neural dysfunction, 
a topic that has come into the focus of research in the computational neurosciences recently. 

This chapter is supposed to give a short introduction to the fundamentals of information 
theory; not only, but especially suited for people having a less firm background in mathematics 
and probability theory. Regarding applications, the focus will be on neuroscicntific topics. 

We start by reviewing fundamentals of probability theory such as the notion of proba- 
bility, probability distributions and random variables. We will then discuss the concepts 
of information and entropy (in the sense of Shannon), mutual information and transfer 
entropy (sometimes also referred to as conditional mutual information). As these quantities 
cannot be computed for measured data in practice, we discuss estimation-techniques for 
information-theoretic quantities. 

We conclude with a discussion of applications of information theory in the field of 
neuroscience, including questions of possible medical applications and a short review of 
software packages that can be used for information-theoretic analyses of neural data. 



1 Introduction 

Neural systems process information. This processing is of fundamental biological importance for 
all animals and humans alike as its main (if not sole) biological purpose is to ensure the survival 
of an individual (in the short run) and its species (in the long run) in a given environment by 
means of perception, cognition, action and adaption. 

Information enters a neural system in form of sensory input representing some aspect of the 
outside world, perceivable by the sensory modalities present in the system. After processing this 
information or parts of it, the system may then adjust its state and act according to a perceived 
change in the environment. 

This general model is applicable to very basic acts of cognition as well as to ones requiring 
higher degrees of cognitive processing. Yet, the underlying principle is the same. Thus measuring, 
modeling and (in the long run) understanding information processing in neural systems is of prime 
importance for the goal of gaining insight to the functioning of neural systems on a theoretical 
level. 

Note that this question is of theoretical and abstract nature so that we take an abstract 
view on information in what follows. We use Shannon's theory of information |97| as a tool that 
provides us with a rigid mathematical theory and quantitative measures of information. Using 
information theory, we will have a conceptual look at information in neural systems. In this 
context, information theory can provide both explorative and normative views on the processing 
of information in a neural system as we will see in Section [6] In some cases, it is even possible to 
gain insights on the nature of the "neural code" , i.e. the way neurons transmit information via 
their spiking activity. 

Information theory was originally used to analyze and optimize man-made communication 
systems, for which the functioning principles are known. None the less, it was soon realized that 
the theory could also be used in a broader setting, namely to gain insight into the functioning of 
systems for which the underlying principles are far from fully understood, such as neural systems 
for example. This was the beginning of the success story of information-theoretic methods in 
many fields of science such as economics, psychology, biology, chemistry, physics and many more. 

The idea of using information theory to quantitatively assess information processing in neural 
systems has been around since the 1950s, see the works of Attneave j6j. Barlow [9^ and Eckhorn 
and Popcl [55J|33]. Yet, as information-theoretic analyses are data-intensive, these methods were 
rather heavily restricted by (a) the limited resources of computer memory and computational 
power available and (b) the limited accuracy and amount of measured data that could be obtained 
from neural systems (on the single cell as well as at the systems level) at that time. However, given 
the constant rise in available computing power and the evolution and invention of data acquisition 
techniques that can be used to obtain data from neural systems (such Magnetoencephalography 
(MEG), functional magnetic resonance imaging (fMRI) or calcium imaging), information-theoretic 
analyses of all kinds of biological and neural systems became more and more feasible and could 
be carried out with greater accuracy and for larger and larger (sub-)systems. 

Over the last decades such analyses became possible using an average workstation computer, 
a situation that could only be dreamed of in the 1970s. Additionally, the emergence of new 
non-invasive data-collection methods such as fMRI and MEG that outperform more traditional 
methods like Electroencephalography (EEG) in terms of spatial resultion (fMRI, MEG) or noise- 
levels (MEG) made it possible to even obtain and analyze system-scale data of the human brain 
in vivo. 

The goal of this chapter is to give a short introduction to the fundamentals of information theory 
and its application to data analysis problems in the neurosciences. And although information- 
theoretic analyses of neural systems were not often previously used in order to gain insight on or 



characterize neural dysfunction so far, this could prove to be a helpful tool in the future. 

The chapter is organized as follows. We first talk a bit about the process of modeling in 
Section [2] that is fundamental for all what follows as it connects reality with theory. As information 
theory is fundamentally based on probability theory, following this we give an introduction to the 
mathematical notions of probabilities, probability distributions and random variables in Section |3j 
If you are familiar with probability theory, you may well skim or skip this section. Section Hi 
deals with the main ideas of information theory. We first take a view on what we mean by 
information and introduce the core concept of information theory, namely entropy. Starting from 
the concept of entropy, we will then continue to look at more complex notions such as conditional 
entropy and mutual information in Section |4.3[ We will then consider a variant of conditional 
mutual information called transfer entropy in Section |4.5| We conclude the theoretical part by 
discussing methods used for the estimation of information-theoretic quantities from sampled data 
in Section [5] What follows will deal with the application of the theoretical measures to neural 
data. We then give a short overview of applications of the discussed theoretical methods in the 
neurosciences in Section [6J and last (but not least). Section [t] constrains a list of software packages 
that can be used to estimate information theoretic quantities for some given data set. 

2 Modeling 

In order to analyze the dynamics and gain a theoretical understanding of a given complex 
system, one usually defines a model first, i.e. a simplified theoretical version of the system to be 
investigated. The rest of the analysis is then based on this model and can only capture aspects of 
the system that are also contained in the model. Thus, care has to be taken when creating the 
model as the following analysis crucially depends on the quality of the model. 

When building a model based on measured data, there is an important thing we have to 
pay attention to, namely that any data obtained by measurement of physical quantities is only 
accurate up to a certain degree and corrupted by noise. This naturally also holds for neural 
data (e.g. electrophysiological single- or multi-cell measurements, EEG, fMRI or MEG data). 
Therefore, when observing the state of some system by measuring it, one can only deduce the 
true state of the system up to a certain error determined by the noise in the measurement (which 
may depend both on the measurement method and the system itself). In order to model this 
uncertainty in a mathematical way, one uses probabilistic models for the states of the measured 
quantities of a system. This makes probability theory a key ingredient to many mathematical 
models in the natural sciences. 

3 Probabilities and Random Variables 

The roots of the mathematical theory of probability lie in the works of Cardano, Fermat, Pascal, 
Bernoulli and de Moivre in the 16th and 17th century, in which the authors attempted to 
analyze games of chance. Pascal and Bernoulli were the first to treat the subject as a branch of 
mathematics, see |106j for a historical overview. Mathematically speaking, probability theory 
is concerned with the analysis of random phenomena. Over the last centuries, it has become 
a well-established mathematical subject. For a more in-depth treatment of the subject see 



3.1 A First Approach to Probabilities via Relative Frequencies 

Let us consider an experiment that can produce a certain fixed number of outcomes (say a coin 
toss, where the possible outcomes are heads or tails or the throw of a die where the die will show 
one of the numbers 1 to 6). The set of all possible outcomes is called the sample space of the 
experiment. 

One possible result of an experiment is called outcome and a set of outcomes is called an 
event (for the mathematically adept: an event is a subset of the power set of all outcomes). Take 
for example the throw of a regular, 6-sided die as an experiment. The set of results in this case 
would be the set of natural numbers {1, . . . , 6} and examples of events are {1, 3, 5} or {2, 4, 6} 
corresponding to the events "an odd number was thrown" and "an even number was thrown" , 
respectively. 

The classical definition of the probability of an event is due to Laplace: "The probability of 
an event to occur is the number of cases favorable for the event divided by the number of total 
outcomes possible" |106j . 

We thus assign each possible outcome a probability, a real number between and 1 that is 
thought of as to describe how "likely" it is that the given event will occur, where means "the 
event doesn't ever occur" and 1 means "the event always occurs" . The sum of all the assigned 
numbers is restricted to be 1 as we assume that one of our considered events always occurs. For 
the coin toss, the possible outcomes heads and tails thus each have probability | (considering 
that the number of favourable outcomes is 1 and the number of possible outcomes is 2) and for 
the throw of a die this number is | for each digit. This is assuming that we have a so-called fair 
coin or die, i.e. one that does not favor the a particular outcomes over the others. 

The probability of a given event to occur is then just the sum of the probabilities of the 
outcomes the event is composed of, e.g. when considering the throw of a die, the probability of 
the event "an odd number is thrown" is h + h + h = h- 

6 6 6 2 

Such types of experiments in which all possible outcomes have the same probability (they are 
called equiprobable) are called Laplacian experiments. The simplest case of an experiment not 
having equiprobable outcomes is the so called Bernoulli experiment. Here, two possible outcomes 
"success" and "failure" , with probabilities p e [0, 1] and 1 - p are considered. Let us now consider 
probabilities in the general setting. 

3.2 An Axiomatic Description of Probabilities 

The foundations of modern probability theory were laid by Kolmogorov |54j in the 1930s. He was 
the first to give an axiomatic description of probability theory based on measure theory, putting 
the field on a mathematically sound basis. We will state his axiomatic description of probabilities 
in the following. This rather technical approach might seem a little complicated and cumbersome 
first and we will try to give well-understandable explanations of the concepts and notions used as 
they are of general importance. 

Kolmogorov's definition is based on what is known as measure theory, a field of mathematics 
that is concerned with measuring the (geometric) size of subsets of a given space. Measure theory 
gives an axiomatic description of a measure (as a function /i assigning a non-negative number to 
each subset) that fulfills the usual properties of a geometric measure of length (in 1-dimensional 
space), area (in 2-dimensional space), volume (in 3-dimensional space), and so on. For example, 
if we take the measure of two disjoint (i.e. non-overlapping) sets, we expect the measure of their 
union to be the sum of the measures of the two sets and so on. 

One prior remark on the definition: When looking at sample spaces (remember, these are the 
sets of possible outcomes of a random experiment) we have to make a fundamental distinction 
between discrete sample spaces (i.e. ones in which the outcomes can be separated and counted. 



like in a pile of sand, where we think of each little sand particle representing one possible outcome) 
and continuous sample spaces (where the outcomes form a continuum and cannot be separated 
and counted, think of this sample space as some kind of dough in which the outcomes cannot be 
separated). Although in most cases the continuous setting can be treated as a straightforward 
generalization of the discrete case and we just have to replace sums by integrals in the formulas, 
some technical subtleties exist, that makes a distinction between the two cases necessary. This is 
why we separate the two cases in all of what follows. 

Definition 3.1 (measure space and probability space) A measure space is a triple {il,T,ii). 
Here 

• the base space 17 denotes an arbitrary nonempty set, 

• T denotes the set of measurable sets in fi which has to be a so called cr-algebra over fi, i.e. 
it has to fulfill 

(i) 6 J- 

(ii) T is closed under complements: if E e ^F, then {^\E) e J^, 
(Hi) T is closed under countable unions: if Ei & T for i = 1,2,..., then (UiEi) e J^, 

• fi is the so called measure.' It is a function /i : .F -> M u {oo} with the following properties 

(i) /i(0) = and fi> (non-negativity) , 

(ii) ji is countably additive: if Ei e T , i = 1^2, .. . is a collection of pairwise disjoint (i.e. 
non-overlapping) sets, then ii{UiEi) = Y,il^{Ei). 

Why this complicated definition of measurable sets, measures, etc.? Well, this is mathematically 
the probably (no pun intended) most simple way to formalize the notion of a "measure" (in terms 
of geometric volume) as we know it over the real numbers. 

When defining a measure, we first have to fix the whole space in which we want to measure. 
This is the base space $7. fi can be any arbitrary set: The sample space of a random experiment, 
e.g. n = {heads, tails} when we look at a coin toss or $7 = {1, . . . , 6} when we look at the throw 
of a die (these are two examples of discrete sets), the set of real numbers K, the real plane K^ 
(these are two examples of continuous sets) or whatever you choose it to be. When modeling the 
spiking activity of a neuron the two states could be "neuron spiked" or "neuron didn't spike" . 

In a second step we choose a collection of subsets of i7 that we name JF, the collection of 
subsets of i7 that we want to be measurable. Note that the measurable subsets of f7 are not given 
a priori, but that we determining those by choosing T. So, you may ask, why this complicated 
setup with T, why not make every possible subset of Vt measurable, i.e. make T the power set 
of 17 (the power set is the set of all subsets of i7)? This is totally reasonable and can easily 
been done when the number of elements of $7 is finite. But as with many things in mathematics, 
things get complicated when we deal with the continuum: In many natural settings, e.g. when 
J7 is a continuous set, this is just not possible or desirable for technical reasons. That is why 
we choose only a subset of the power set (you might refer to its elements as the "privileged" 
subsets) and make only the contained subsets measurable. We want to choose this subset in a way 
that the usual constructions that we know from geometric measures still work in the usual way, 
though. This motivates the properties that we impose on T: We expect to be able to measure 
the complements of measurable sets, as well as the union and intersection of a finite number of 
measurable sets to again be measurable. These properties are motivated by the corresponding 
properties of geometric measures (i.e. the union, intersection and complement of intervals of 



certain lengths has a length and so on). So to sum up, the set JF is a subset of the power set of il, 
and sets that are not in T are not measurable. 

In a last step, we choose a function \i, that assigns a measure (think of it as a generalized 
geometric volume) to each measurable set (i.e. each element of T\ where the measure has to 
fulfill some basic properties that we know from geometric measures: The measure is non-negative, 
the empty set (that is contained in every set) should have measure and the measure is additive. 

All together, this makes the triple (17,^, /i) a space in which we can measure events and use 
constructions that we know from basic geometry. Our definition makes sure that the measure /x 
behaves in the way we expect it to (mathematicians call this a natural construction). Take some 
time to think about it: Definition |3. 1| above generalizes the notion of the geometric measure in 
terms of the length 1{T) = b - a oi intervals / = [a, b] over the real numbers. 

In fact, when choosing the set O = M we can construct the so called Borel a-algebra B 
that contains all closed intervals I = [a,b], a < b and a measure /zg that assigns each interval 
/ = [a,b] e B its length usil) = b - a. The measure /xg is called Borel measure. It is the standard 
measure of length that we know from geometry and makes (M.,B,fis) a measure space. This 
construction can easily be extended to arbitrary dimensions (using closed sets) resulting in the 
measure space (K",i3",/XBn) that fulfills the properties of a 7vdimensional geometric measure of 
volume. 

Let us look at some examples of measure spaces now: 

1. Let 17 ={0,1}, j^= {0,{O},{l},n} and P with P(0) = P(l) = 0.5. This makes {n,T,P) a 
measure space for our coin toss experiment. Note that in this simple case, !F equals the full 
power set of fi. 

2. Let fi= {a,6,c,d} and let T^ {0, {a,6}, {c,d},f]} with P({a, 6}) = p and P{{c,d}) = l-_p, 
where p denotes an arbitrary number between and 1. This makes (il,.F, P) a measure 
space. 

Having understood the general case of a measure space, defining a probability space and a 
probability distribution is easy. 

Definition 3.2 (probability space, probability distribution) A probability space is a mea- 
sure space (VL^T,ii) for which the measure p is normed, i.e. fi: H, ^ [0, 1] with /i(r2) = 1. The 
measure fj, is called probability distribution and is often also denoted by P (for probability) . $7 is 
called the sample space, elements of fJ are called outcomes and T is the set of events. 

Note that again, we make the distinction between discrete and continuous sample spaces here. 
In the course of history, a probability distribution on a discrete sample space came to be called 
probability mass function (or pmf) and a probability distribution defined on a continuous sample 
space came to be called probability density function (or pdf). 

Let us look at a few examples, where the probability spaces in the following are given by the 
triple {n,T,P). 

1. Let fi = {heads, tails} and let JF = {0, {heads}, {tails}, O}. This is a probability space for 
our coin toss experiment, where relates to the event "neither heads nor tails" and fi to 
the event "either heads or tails" . Note that in this simple case, !F equals the full power set 
oin. 

2. Let il = {1, . . . , 6} and let JF be the full power set of Q, (i.e. the set of all subsets of fi, there 
are 6^ = 36, can you enumerate them all?). This is a probability for our experiment of dice 
throws, where we can distinguish all possible events. 
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Figure 1: Theoretical quantities and measurable quantities: The only things observable and 
accessible usually are data (measured or generated), all theoretical quantities are not directly 
accessible. They have to be estimated using statistical methods. 



3.3 Theory and Reality 

It is important to stress that probabilities themselves are a mathematical and purely theoretical 
construct to help in understanding and analyzing random experiments, and per se they do not 
have to do anything with reality. They can be understood as an "underlying law" that generates 
the outcomes of a random experiment and can never be directly observed, see Figure [Tl But 
with some restrictions they can be estimated for a certain given experiment by looking at the 
outcomes of many repetitions of that experiment. 

Let us consider the following example. Assume that our experiment is the roll of a six-sided 
die. When repeating the experiment for 10 times (also called trials) we will obtain frequencies 
for each of the numbers as given in Figure 2(a) Repeating the experiment for 100 times we 
will get frequencies that look similar to the ones given in Figure |2(b)[ If we look at the relative 
frequencies (i.e. the frequency divided by the total number of trials), we see that these converge 
to the theoretically predicted value of g as our number of trials grows larger. 

This fundamental finding is also called the "Borel's law of large numbers" . 

Theorem 3.3 (Borel's law of large numbers) Let il be a sample space of some experiment 
and let P be a probability mass function on fi. Furthermore let Nn{E) be the number of occurrences 
of the event E c CI when the experiment is repeated n times. Then the following holds: 



NniE) 



P{E) 



as n 



Borel's law of large numbers states that if an experiment is repeated many times (where the 
trials have to be independent and done under identical conditions), then the relative frequency of 
the outcomes converge to their probability as assigned by the probability mass function. The 
theorem thus establishes the notion of probability as the long-run relative frequency of an events 
occurrence and thereby connects the theoretical side to the experimental side. Keep in mind 
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Figure 2: Relative frequencies of tossed digits using a fair die: (a) after 10 tosses and |(b)| after 
1000 tosses. 



though that we can never directly measure probabilities and although relative frequencies will 
converge to the probability values, they will usually not be exactly equal. 

3.4 Independence of Events and Conditional Probabilities 

A fundamental notion in probability theory is the idea of independence of events. Intuitively, we 
call two events independent if the occurrence of one does not affect probability of occurrence of 
the other. Consider for example the events that it rains and the event that the current day of 
the week is Monday. These two are clearly independent, unless we lived in a world where there 
would be a correlation between the two, i.e. where the probability of rain would be different on 
Mondays compared to the other days of the week which is clearly not the case. 

Similarly, we establish the notion of independence of two events in the sense of probability 
theory as follows. 

Definition 3.4 (independent events) Let A and B he two events of some probability space 
(iljEjP). Then A and B are called independent if and only if 



P{AnB) =P{A)P{B). 



(3.1) 



The term P(A n B) is referred to joint probability of A and B, see Figure l3] 
Another important concept is the notion of conditional probability, i.e. the probability of one 
event A occurring, given the fact that another event B occurred. 
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AUB 

Figure 3: Two events A and B, their union AuB, their intersection AnB (i.e. common occurrence 
in terms of probability) and their exclusive occurrences A n i?*^ (A and not B occurs) , B n A'~^ 
{B occurs and not A), where -"^ denotes the complement in Au B. 

Definition 3.5 (conditional probability) Given two events A and B of some probability space 
{n,T,P) with P{B) > we call 

^ ' ' P{B) 
the conditional probability of A given B. 

Note that for independent events A and B, we have P{A n B) = P{A)P{B) and thus 
P{A\B) = P{A) and P{B\A) = P{B). We can thus write 

P{AnB) = P{A)P{B), 

and this means that the occurrence of A does not affect the conditional probability of B given 
A (and vice versa) . This exactly reflects the intuitive definition of independence that we gave in 
the first paragraph of this section. Note that we could have also used the conditional probabilities 



to define independence in the first place. None the less the definition of Equation 3.1 is preferred, 
as it is shorter, symmetrical in A and B and due to the fact that conditional probabilities above 
are not defined in the case where P{A) = or P{B) = 0. 

3.5 Random Variables 

In many cases the sample spaces of random experiments are a lot more complicated than the 
ones of the toy examples we looked at so far. Think for example of measurements of membrane 
potentials of certain neurons, that we want to model mathematically, or the state of some 
complicated system, e.g. a network of neurons receiving some stimulus. 
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Thus mathematicians came up with a way to tame the sample spaces by looking at the events 
indirectly, namely by first mapping the events to some better understood space, like the set of 
real numbers (or some higher dimensional real vector space) and then look at outcomes of the 
random experiment in the simplified space rather than in the complicated original space. Looking 
at spaces of numbers has many advantages: order relations exist (smaller, equal, larger), we can 
form averages and much more. This leads to the concept of random variables. 

A (real) random variable is a function that maps each outcome of a random experiment to 
some (real) number. Thus, a random variable can be thought of as a variable whose value is 
subject to variations due to chance. But keep in mind that a random variable is a mapping and 
not a variable in the usual sense. 

Mathematically, a random variable is defined using what is called a measurable function. 
A measurable function is nothing more than a map from one measurable space to another for 
which the pre-image of each measurable set is again measurable (with respect to the two different 
measures in the two measure spaces involved). So a measurable map is nothing more than a 
"nice" map respecting the structures of the spaces involved (take as an example for such maps the 
continuous functions over M). 

Definition 3.6 (random variable) Let (S1,S,P) be a probability space and (f]',S') a measure 
space. A {T,,Yl') -measurable Junction X:f2 ->■ f2' is called O'-valucd random variable (or just 
f2'-random variable^ on 17 . 

Commonly, a distinction between continuous random variables and discrete random variables 
is made, the former taking values on some continuum (in most cases M) and the latter on a 
discrete set (in most cases Z). 

A type of random variable that plays an important role in modeling is the the so called 
Bernoulli random variable that only takes two distinct values with probability p and 1 with 
probability 1-p (i.e. it has a Bernoulli distribution as its underlying probability distribution). 
Spiking behavior of a neuron is often modeled that way, where 1 stands for "neuron spiked" and 
for "neuron didn't spike" (in some interval of time). 

A real- or integer-valued random variable X thus assigns a number X{E) to every event 
E &Ti. A value X{E) corresponds to the occurrence of the event E and is called a realization 
of X. Thus, random variables allow for the change of space in which outcomes of probabilistic 
processes are considered. Instead of considering an outcome directly in some complicated space, 
we first project it to a simpler space using our mapping (the random variable X) and interpret 
its outcome in that simpler space. 

In terms of measure theory, a random variable X : {fl, E, P) -* {ft' , S') (again, considered as 
a measurable mapping here) induces a probability measure Px on the measure space (fi', E') via 

Px{S'):=P{X-\S')), 

where again X~^{S') denotes the pre-image of S' € E'. This also justifies the restriction of X 
to be measurable: If it were not, such a construction would not be possible, but this is a technical 
detail. As a result, this makes (O', E', Px) a probability space and we can think of the measure 
Px as the "projection" of the measure P from ft onto ft' (via the measurable mapping X). 

The measures P and Px are probability densities for the probability distributions over 17 and 
17': They measure the likelihood of occurrence for each event (P) or value (Px)- 

As a simple example of a random variable consider again the example of the coin toss. Here, 
we have 17 = {heads, tails}, T = {0, {heads}, {tails}, 17} and P that assigns to both heads and 
tails the probability 2 forming the probability space. Consider as a random variable AT : 17 ^ 17' 
with 17' = {0,1} that maps 17 to 5 such that A(heads) = and A(tails) = 1. If wc choose 
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T' = {0, {0}, {1}, {0, 1}} as a cr-algebra for fi' this makes M = (J7', JF') a measurable space and 
X induces a measure P' = Px on M with P'({0}) = P'({1}) = |. That makes {n',J^',P') a 
measure space and since P' is normed it is a probabiUty space. 

Cumulative Distribution Function 

Using random variables that take on values of whole or the real numbers, the natural total 
ordering of elements in these spaces enables us to define the so called cumulative distribution 
function (or cdf) for a random variable. 

Definition 3.7 (cumulative distribution function) Let X beaM.-valuedorZ-valuedrandom 
variable on some probability space {Q,T.,P). Then the function 

F{x) := P{X < x) 

is called the cumulative distribution function of X . 

The expression expression P{X < x) evaluates to 

PiX <x)= f P{X = r) dr, 

J T<X 

in the continuous case and to 

PiX <x)= Y^P{X = k) 

k<x 

in the discrete case. 

In that sense, the measure Px can be understood as the derivative of the cumulative distribution 
function F 

P{XI<X<X2)=F{X2)-F{xi), 

and we also write F{x) = f^^^ Px{t) dr in the continuous case. 

Independence of Random Variables 

The definition of independent events directly transfers to random variables: Two random variables 
X, Y are called independent if the conditional probability distribution of X (Y) given an observed 
value of Y (X) does not differ from the probability distribution of X (Y) alone. 

Definition 3.8 (independent random variables) Let X, Y be two random variables. Then 
X and Y are called independent, if the following holds for any observed values x of X and y of 
Y: 

P{X\Y = y) = P{X) and P{Y\X = x) = P{Y). 
This notion can be generalized to the case of three or more random variables naturally. 
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Expectation and Variance 

Two very important concepts of random variables are tlie so called expectation value (or just 
expectation) and the variance. The expectation of a random variable X is the mean value of 
the random variable, where the weighting of the values corresponds to the probability density 
distribution. It thus tells us what value of X we should expect "on average" : 

Definition 3.9 (expectation value) Let X be a R- or "L-valued random variable. Then its 
expectation value (sometimes also denoted by fj.) is given by 

E[X]:= fxPx(x)dx= fxdPx, 

for a real-valued random variable X and by 

E[X]:=Y.xPx{x) 

xeZ 

if X is Z-valued. 

Note that if confusion can be made as to which probability distribution the expectation value 
is taken, we will include the probability distribution to which the expectation value is taken in 
the index. Consider for example two random variables X and Y defined on the same base space 
but with different underlying probability distributions. In this case, we denote by Ex [Y] the 
expectation value of Y taken with respect to the probability distribution of X. 

Let us now look an example. If we consider the throw of a fair die with P{i) = | for each 
digit z = 1, . . . , 6 and take X as the random variable that just assigns each digit its integer value 
X{i) = i, we get E[X] = i(l + ■■■ + 6) = 3.5. 

Another important concept is the so-called variance of a random variable. The variance is a 
measure for how far the values of the random variable are spread around its expected value. It is 
defined as follows. 

Definition 3.10 (variance) Let X be aR- or T^-valued random variable. Then its variance is 
given as 

var[X] := E[{E[X] - Xf] = {E[X]f - E[X% 

sometimes also denoted as a^ . 

The variance is thus the expected squared distance of the values of the random variable to its 
expected value. 

Another commonly used measure is the so called standard deviation o'{X) = Yvar(Ar), a 
measure for the average deviation of realizations of X from the mean value. 

Often one also talks about the expectation value as "first order moment" of the random 
variable, the variance as a "second order moment" . Higher order moments can be constructed by 
iteration, but will not be of interest to us in the following. 

Note again that the concepts of expectation and variance live on the theoretical side of the 
world, i.e. we cannot measure these quantities directly. The only thing that we can do is try to 
estimate them from a set of measurements (i.e. realizations of the involved random variables), 
see Figure [l] The statistical discipline of estimation theory deals with question regarding the 
estimation of theoretical quantities from real data. We will talk about estimation in more detail 
in Section [5] and just give two examples here. 

For estimating the expected value we can use what is called the sample mean. 
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Definition 3.11 (sample mean) Let X be aR- or Jj-valued random variable with n realizations 
xi,. . . ,Xn- Then the sample mean fi of the realizations is given as 

1 " 

flyXi^ . . . , Xji ] — y Xi 

As we will see below, this sam.ple mean provides a good estimation of the expected value if 
the number n of samples is large enough. Similarly, we can estimate the variance as follows. 

Definition 3.12 (sample variance) Let X be aR- or Z-valued random variable with n real- 
izations Xi, . . . ,Xn- Then the population variance a of the realizations is given as 

1 " 
(7^(xi,...,a;„) := - Yixi - ji{xi, . . . ,Xn)Y , 

where fi denotes the sample mean. 

Before going on let us calculate some examples of expectations and variances of random 
variables. Take the coin toss example from above. Here, the expected value of X is -E[X] = 
I • + i • 1 = |, the variance var(X) = E[{E[X] - Xf] = i . (0 - |)2 + i . (1 - i)2 = i . Por the 
example of the dice roll (where the random variable X takes the value of the number thrown) we 
get E[X] = 1+2+3+4+5+6 = I = 3.5 and var(X) = {E[X]Y + E[X^] = ^ - f = f| « 2.92. 

3.6 Laws of Large Numbers 

The laws of large numbers (there exist two versions as we will see below) state that the sample 
average of a set of realizations of a random variable "almost certainly" converges the the random 
variable's expected value when the number of realizations grows to infinity. 

Theorem 3.13 (law of large numbers) Let Xl,X2^ ... be an infinite sequence of independent, 
identically distributed random variables with expected values E{Xi) = E{X2) = ■■■ = /i. Let 
Xn = -(^1 + ■•■ + Xn) be the sample average. 

(i) Weak law of large numbers. The sample average converges in probability towards the 
expected value, i.e. for any e > 

\im P(\Xn-fi\>e) = 0. 
This is sometimes also expressed as 

Xn — > /i when n ^ oo. 

(ii) Strong law of large numbers. The sample average converges almost surely towards the 
expected value, i.e. 

p(limX„ = M) = l. 

This is sometimes also expressed as 

Xn > fJ- when n ^ oo. 
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The weak version of the law states that the sample average X„ is likely to be close to /i for 
some large value of n. But this does not exclude the possibility of |X„ - fi\ > e occurring an 
infinite number of times. 

The strong law says that this "almost surely" will not be the case: With probability 1, the 
inequality \Xn - fJ.\< e holds for all e > and all large enough n. 

3.7 Some Parametrized Probability Distributions 

Certain probability distributions often occur naturally when looking at typical random experiments. 
In the course of history, these were thus put (mathematicians like doing such things) into families 
or classes and the members of one class are distinguished by a set of parameters (a parameter is 
just a number than can be chosen freely in some specified range). To specify a certain probability 
distribution we simply have to specify in which class it lies and which parameter values it 
exhibits, which is more convenient than specifying the probability distribution explicitly every 
time. This also allows proving (and reusing) results for whole classes of probability distributions 
and, facilitates communication with other scientists. 

Note that we will only give a concise version of the most important distributions relevant in 
neuroscientific applications here and point the reader to [471 1981 152] for a more in-depth treatment 
of the subject. 

The normal distribution N{fi, a) is a family of continuous probability distributions parametrized 
by two real- valued parameters /i e M and a^ e M^, called mean and variance. Its probability 
density function is given as 

1 1 / x-fi -.2 

rp i_i, _ g 2 V (T / ^ 



crV27r 

The family is closed under linear combinations, i.e. linear combinations of normally distributed 
random variables are again normally distributed. It is the most important and often used 
probability distribution in probability theory and statistics as many other probability distributions 
can be approximated by a normal distribution when the sample size is large enough (this fact 
is called the central limit theorem). See Figured for examples of the pdf and cdf for normally- 
distributed random variables. 

The Bernoulli probability distribution Ber(p) describes the two possible outcomes of a Bernoulli 
experiment with the probability of success and failure being p and 1 - p, respectively. It is thus 
a discrete probability distribution on two elements and it is parametrized by one parameter 
p 6 [0, 1] c M. Its probability mass function is given by the two values P(success) = p and 
P(success) = 1- p. 

The binomial probability distribution B(n,p) is a discrete probability distribution parametrized 
by two parameters n e N and p e [0, 1] c M. Its probability mass function is 

f{k;n,p) = Qp''{l-pr-', (3.2) 

and it can be thought of as a model for the probability of k successful outcomes in a trial 
with n independent Bernoulli experiments, each having success probability p. 

The Poisson distribution Poiss(A) is a family of discrete probability distributions parametrized 
by one real parameter A e M^ . Its probability mass function is given by 
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Fig ure 4: Normal distribution: probability density function (a) and cumulative density function 
(b) for selected parameter values of /i and <t. Solid line: /i = Ojo^ = 1, dashed line: /x = 1, cr^ = 0.2, 
dotted line: fi = -1, a^ = 0.5. 
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The Poisson distribution plays an important role in the modeling of neuroscience data. This 
is the case because the firing statistics of cortical neurons (and also other kinds of neurons) can 
often be well fit by a Poisson process, where A is considered the mean firing rate of a given neuron, 

see una [Ml Eg. 

This fact comes at no surprise if we invest some thought. The Poisson distribution can be 
seen as a special case of the binomial distribution. A theorem known as Poisson limit theorem 
(sometimes also called "law of rare events" ) now tells us that in the limit p ^ and n ^ oo the 
binomial distribution converges to the Poisson distribution with A = np. Consider for example 
the spiking activity of our neuron that we could model via a Binomial distribution. We discretize 
time and consider time bins of say 2 ms and assume a mean firing rate of the neuron denoted by 
A (measured in Hertz). Clearly, in most time bins the neuron does not spike (corresponding to a 
small value of p) and the number of bins is large (corresponding to a large n). The Poisson limit 
theorem tells us that in this case the probability distribution concerning spike emission is well 
matched by a Poisson distribution. 

See Figure |6] for examples of the pmf and cdf for Poisson-distributed random variables for a 
selection of parameters A. 

The so called exponential distribution Exp(A) is a continuous probability distribution parametrized 
by one real parameter A e M^ . Its probability density function is given by 

Ae"'^^ for a; > 
for a; < ' 

The exponential distribution with parameter A can be interpreted as the probability distribution 
describing the time between two events in a Poisson process with parameter A, see the next 
section. 

See Figure [7] for examples of the pdf and cdf for exponentially-distributed random variables 
for a selection of parameters A. 

We want to conclude our view on families on probability distributions at this point and 
point the interested reader to [171 IHHl [S2] regarding further examples and details of families of 
probability distributions. 

3.8 Stochastic Processes 

A stochastic process (sometimes also called random process) is a collection of random variables 
indexed by a totally ordered set, which is usually taken as time. Stochastic processes are commonly 
used to model the evolution of some random variable over time. We will only look at discrete-time 
processes in the following, i.e. stochastic processes that are indexed by a discrete set. The 
extension to the continuous case is straightforward, see '18] for an introduction to the subject. 
Mathematically, a stochastic process is defined as follows. 

Definition 3.14 Let {Q,J^,P) be a probability space and let {S,S) be a measure space. Let 
furthermore Xt ■ T ^ S he a set of random variables, where t e T. Then an S'-valued stochastic 
process V is given by 
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Figure 5: Binomial distribution: probability mass function (a) and cumulative density function 
1(b) I for selected parameter values of p and n. Circle: p = 0.2, n = 20, triangle: p = 0.5, n = 20, 
square: p = 0.7, n = 40. 
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Figure 7: Exponential distribution: probability density function |(a)| and cumulative density 
function 1(b) I for selected parameter values of A. 



21 



V:={Xt:teT}, 

where T is some totally ordered set, commonly interpreted as time. The space S is referred to 
as the sample space of the process V. 

If the distribution underlying the random variables Xt does not vary over time, the process is 
called homogeneous, in the case where the probability distributions Px^ depend on the time t, it 
is called inhomogeneous. 

A special kind and well-studied type of stochastic process is the so called Markov process. 
A discrete Markov process of order fc e N is a inhomogeneous stochastic process subject to the 
restriction that for any time t = 0,1, . . . , the probability distribution underlying Xt only depends 
on the preceding k probability distributions of Xt-i, . . . ,Xt^if., i.e. that for any t and any set of 
realizations Xi oi Xi {0 < i < t) we have 

P{Xt = Xt\Xt-i =Xt-i,...,Xt-k = xt-k) = P{Xt = Xt\Xt-i = Xt-i,...,Xo = xo). 

Another process often considered in neuroscientific applications if the Poisson process. It is 
a continuous-time stochastic process V for which the random variables are Poisson-distributed 
with some parameter A(i) (in the inhomogeneous case, for the homogeneous case we have 
A(i) = A =constant). As can be shown, the time delay between each pair of consecutive events 
of a Poisson process is exponentially distributed. See Figure [8] for examples of the number of 
instantaneous (occurring during one time slice) and the number of cumulated events (over all 
preceding time slices) for Poisson processes for a selection of parameters A. 

Poisson processes have proven to be a good model for many natural as well as man-made 
processes such as radioactive decay, telephone calls and queues, and also for modeling neural data. 
An influential paper in the neurosciences was [23], showing the random nature of the closing 
and opening of single ion channels in certain neurons. Using a Poisson process with the right 
parameter provides a good fit to the measured data here. 

Another prominent example of neuroscientific models employing a Poisson process is the 
commonly used model for the sparse and highly irregular firing patterns of cortical neurons in 
vivo [1011 [Ml jTlj . The firing patterns of such cells are usually modeled using inhomogeneous 
Poisson processes (with \{t) modeling the average firing rate of a cell). 

4 Information Theory 

Information theory was introduced by Shannon |97j as a mathematically rigid theory to describe 
the process of transmission of information over some channel of communication. His goal was 
quantitatively measure the "information content" of a "message" sent over some "channel" , see 
Figure [TOJ In what follows we will not go into detail regarding all aspects of Shannon's theory, 
but we will mainly focus on his idea of measuring "information content" of a message. For a more 
in-depth treatment of the subject, the interested reader is pointed to the excellent book |26| . 



The central elements of Shannon's theory are depicted in Figure 10 In the standard setting 
considered in information theory, an information source produces messages that are subsequently 
encoded using symbols from an alphabet and sent over a noisy channel to be received by a receiver 
that decodes the message and attempts to reconstruct the original message. 

A communication channel (or just channel) in Shannon's model transmits the encoded message 
from the sender to the receiver. Due to noise present in the channel the receiver does not receive 
the original message dispatched by the sender but rather some noisy version of it. 
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Figure 8: Examples of the number of events in one time window of size At = 1 (a) and the number 
of accumulated events since t = O |(b)| for Poisson processes with rates A = 1 (circle), A = 5 (triangle) 
and A = 10 (square) . 
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Figure 10: The setting of Shannon's information theory: information is transferred from a source 
to a destination via a message that is first encoded, and then subsequently sent over a noisy 
channel to be decoded by the receiver. 

The whole theory is set in the field of probability theory (hence our introduction to the 
concepts in the last section) and in this context, the messages emitted by the source are modeled 
as a random variable X with some underlying probability distribution Px- For each message 
X (a realization of X), the receiver sees a corrupted version y oi x and this fact is modeled by 
interpreting the received messages as realizations of a random variable Y with some probability 
distribution Py (that depends both on Px and the channel properties). The transmission 
characteristics of the channel itself are characterized by the stochastic correspondence of the 
signals transmitted by the sender to the ones received by the receiver, i.e. by modeling the channel 
as a conditional probability distribution Pyix- 

Being based upon probability theory, keep in mind that the all the information-theoretic 
quantities that we will look at in the following such as "entropy" or "mutual information" are 
just properties of the random variables involved, i.e. properties of the probability distributions 
underlying these random variables. 

Information-theoretic analyses have proven to be a valuable tool in many areas of science 
such as physics, biology, chemistry, finance and linguistics and generally in the study of complex 
systems [HI E2 ■ We will have a look at applications in the neurosciences in Section [6] 

Note that a vast number of works was published in the field of information theory and its 
applications since its first presentation in the 1950s. We will focus on the core concepts in the 
following and point the reader to |26| for a more in-depth treatment of the subject. 

In the following we will start by looking at a notion of information and using this proceed to 
define entropy (sometimes also called Shannon entropy), a core concept in information theory. As 
all further information-theoretic concepts are based on the idea of entropy, it is of vital importance 
to understand this concept well. We will then look at mutual information, the information shared 
by two or more random variables. Furthermore, we will look at a measure of distance for 
probability distributions called KuUback-Leibler Divergence and give an interpretation of mutual 
information in terms of KuUback-Leibler Divergence. After a quick look at the multivariate 
case of mutual information between more than two variables and the relation between mutual 
information and channel capacity we will then proceed to a information-theoretic measure called 
transfer entropy. Transfer entropy is based on mutual information but in contrast to mutual 
information is of directed nature. 

4.1 A Notion of Information 

Before defining entropy, let us try to give an axiomatic definition of the concept of "information" . 
The entropy of a random variable will then be nothing more that the expected (i.e. average) 
amount of information contained in a realization of that random variable. 

We want to consider a probabilistic model in what follows, i.e. we have a set of events, each 
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Figure 11: The logarithm to the basis of 2. 

occurring with a given probabihty. The goal is to assess how informative the occurrence of a given 
event is. What would we intuitively expect from a measure of information h that maps the set of 
the events to the set of non-negative real number, i.e. when we restrict /i to be a non-negative 
real number? 

First of all, it should certainly be additive for independent events and sub-additive for non- 
independent events. This is easily justified: If you read two newspaper articles about totally 
unrelated subjects, the total amount of information you obtain consists of both the information 
in the first and the second article. When you read articles about related subjects, they often have 
some common information. 

Furthermore, events that occur regularly and unsurprisingly are not considered informative 
and the more seldom or surprising an event occurs, the more informative it is. Think about an 
article about your favorite sports team winning a match that usually wins all matches. You will 
consider this not very informative. But when the local newspaper reports about an earthquake 
with its epicenter in the part of town where you live, this will certainly be informative to you 
(unless you were at home during the time the earthquake happened). 

We thus have the following axioms for the information content h of an event, where we look 
at the information content of events contained in some probability space (f2, S, P). 

(i) h is non- negative: /i : S ^ M^. 

(ii) h is sub-additive: For any two messages u>i,uj2 e S we have h{uji n 1x12) < h{LUi) + /i(cj2), 
where equality holds if and only if wi and 102 are independent. 

(iii) h is continuous and monotonic with respect to the probability measure P. 

(iv) Events with probability 1 are not informative: /i(a;) = for a; € S with P{uj) = 1. 

Now calculus tells us (this is not hard to show — you paid attention in the mathematics class 
at school, didn't you?) that these four requirements leave only one possible function that fulfills 
all these requirements: the logarithm. This leads us to the following natural definition. 

Definition 4.1 (Information) Let {il,Y,,P) be a probability space. Then the information h of 
an event a e Yi is defined as 

hicT):=hiPicT)) = -\og,iPicT)), 

where b denotes the basis of the logarithm. 

For the basis of the logarithm, usually 6 = 2 or 6 = e is chosen, fixing the unit of h as "bit" 
or "nat" , respectively. We resort to using & = 2 for the rest of this chapter and write log for the 
logarithm to the basis of two. The natural logarithm will be denoted by In. 
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Note that the information content in our definition only depends on the probabihty of the 
occurrence of the event and not the event itself. It is thus a property of the probability distribution 
P. 

Let us give some examples in order to illustrate this idea of information content. 

Consider a toss of a fair coin, where the possible outcomes are heads (H) or tails (T), each 
occurring with probability | . What is the information contained in a coin toss? As the information 
solely depends on the probability, we have h{H) = h{T), which comes at no surprise. Furthermore 
we have h{H) = h{T) = -log | = -{\og{l)-log2{2)) = log 2 = 1 bit, when we apply the fundamental 
logarithmic identity log(a • b) = log(a) + log(6). Thus one toss of a fair coin gives us one bit of 
information. This fact also lets us explain the unit attached to h. If measured in bit (i.e. with 
6 = 2), this is the amount of bits needed to store that information. For the toss of a coin we need 
one bit, assigning each outcome to either or 1. 

Repeating the same game for the roll of a fair die where each digit has probability | , we again 
have the same amount of information for each digit E e {1, . . . , 6}, namely h{E) = log(6) « 2.58 
bit. This means that in this case we need 3 bits to store the information associated to each 
outcome, namely the number shown. 

Looking at the two examples above, we can give another (hopefully intuitive) characterization 
of the term information content: It is the minimal number of yes-no-questions that we have to 
ask until we know which event occurred, assuming that we have a knowledge of the underlying 
probability distribution. Consider the example of the coin toss above. We have to ask exactly 
one question and we know the outcome ( "Was it heads?" , "Was it tails?" ) . 

Things get more interesting when we look at the case of the die throw. Here several question 
asking strategies are possible and you can freely choose your favorite - we will give one example 
below. 

Say a digit d was thrown. The first question could be "Was the digit less or equal to 3?" (other 
strategies "Was the digit greater or equal to 3?", "Was the digit even?", "Was the digit odd?"). 
We then go on depending on the answer and cut off at least half of the remaining probability 
mass in each step, leaving us with a single possibility after at most 3 steps. From the information 
content we know that on overage we have to ask 2.58 times on average. 

The two examples above were both cases with uniform probability distributions but in principle 
the same applies to arbitrary probability distributions. 

4.2 Entropy as Expected Information Content 

The term entropy is at the heart of Shannon's information theory [97 . Using the notion of the 
information as discussed in Section |4.1[ we can readily define the entropy of a discrete random 
variable as its expected information. 

Definition 4.2 (entropy) Let X be a random variable on some probability space (fi, S,P) with 
values in the integer or the real numbers. Then its entropjrl (sometimes also called Shannon 
entropy or sclf-informationj H{X) is defined as the expected amount of information of X , 

H{X):=E[h{X)]. (4.1) 



If X is a random variable that takes integer values (i.e. a discrete random variable). Equation 4.1 
evaluates to 

H{X) =Y.PiX = x)h{P{X = x)) = - ^ P{X = x) log{P{X = x)). 



^Shannon chose the letter H for denoting entropy after Boltzniann's i?-theorem in classical statistical mechanics. 
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in the case of a real- valued, continuous random variable we get 

H(X)= f P(X = x)h(P(X = x)) dx 

Jr 



and the resulting quantities is called differential entropy |26| . 

As the information content is a function solely dependent on the probability of the events one 
also speaks of the entropy of a probability distribution. 

Looking at the definition in Equation |4.1[ we see that entropy is a measure for the average 
amount of information that we expect to obtain when looking at realizations of a given random 
variable X. An equivalent characterization would be to interpret it as the average information 
one is missing when one would not know the value of the random variable (i.e. its realization) and 
a third one would be to interpret it as the average reduction of uncertainty about the possible 
values of a random variable having observed one or more realizations. 

Akin to the information content h, entropy H is & dimensionless number and usually measured 
in bits (i.e. the expected number of binary digits needed to store the information) by taking a 
logarithm to the base of 2. 

Shannon entropy has many applications as we will see in the following and constitutes the 
core of all things labeled "information theory" . Let us thus look a bit closer at this quantity. 

Lemma 4.3 Let X be some discrete random variable. Then its entropy H{X) satisfies the two 
inequalities 

0<H{X)<\og{n). 



Note that the first inequality is a direct consequence of the properties of the information 
content and the second follows from Gibbs' inequality [26j . 

With regard to entropy, probability distributions having maximal entropy are often of interest 
in applications as they can be seen as the least restricted ones (i.e. having the least a priori 
assumptions), given the model parameters. The principle of maximum entropy states that when 
choosing among a set of probability distributions with certain fixed properties, the preference 
should be given to distributions that have the maximal entropy among all considered distributions. 
This choice is justified as the one making the fewest assumptions on the shape of the distribution 
apart from the properties fixed before choosing. 

For discrete probability distributions, the uniform distribution is the one with the highest 
entropy among all other distributions on the same base set. This can be well seen in the example 



in Figure 12 where the entropy takes its maximum at p = 1/2, corresponding to the uniform 
probability distribution on the two elements and 1, each occurring with probability 1/2. 

For continuous, real- valued random variables with a given finite mean n and variance a^, the 
normal distribution with mean /i and variance a^ has highest entropy. Demanding additionally 
non-negativity in the latter case yields the exponential distribution with parameter A = 1/fi as a 
maximum-entropy distribution. 

Examples 

Before continuing, let us now compute some more entropies in order to get a feeling for this 
quantity. 

For a uniform probability distribution P on n events fl = {wi, . . . , aj„} each event has probability 
P{uJi) = l/n and we obtain 
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Figure 12: Entropy H{X) of a Bernoulli random variable X as a function of success probability 
p = P(X = 1). The maximum is attained at p = 1/2. 



n 1 1 

H{P) = -Y,-^og-=logn, 

tin n 

as the maximal entropy for all discrete probability distributions on the set fi. 

Let us now compute the entropy of a Bernoulli random variable X, i.e. a binary random 
variable X taking values and 1 with probability p and 1 - p, respectively. For the entropy of X 
we get 



H{X) = -{p\ogp+{l-p) log(l -p)). 



See Figure 12 for a plot of the entropy seen as a function of the success probability p. 
As expected, the maximum is attained at p = 1/2, corresponding to the case of the uniform 
distribution. 

Computing the differential entropy of a normal distribution N{^, a^) with mean /i and variance 
(T^ yields 

iJ(iV(M,a2))=^log(27rea2), 

and we see that the entropy does not depend on the mean value of the distribution but just 
its variance. This is not surprising, as the shape of the probability distribution is only changed 
by (T^ and not fi. 

For an example of how to compute the entropy of spike trains see Section [61 

Joint Entropy 

Generalizing the notion of entropy to two or more variables we can define the so called joint 
entropy to quantify the expected uncertainty (or expected information) in a joint distribution of 
random variables. 
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Definition 4.4 (joint entropy) Let X and Y be discrete random variables on some probability 
spaces. Then the joint entropy of X and Y is given by 

i/(X,r) = -i?x,y[logP(x,y)] = -^P(x,2/)logP(x,y), (4.2) 

where Px.y denotes the joint probability distribution of X and Y and the sum runs over all 
possible values x and y of X and Y , respectively. 

This definition allows a straightforward extension to the case of more than two random 
variables. 

The conditional entropy H{X\Y) of two random variables X and Y quantifies the expected 
uncertainty (respectively expected information) remaining in a random variable X under the 
condition that a second variable Y was observed or equivalently as the reduction of the expected 
uncertainty in X upon the knowledge of Y . 

Definition 4.5 (conditional entropy) Let X and Y be discrete random variables on some 
probability spaces. Then the conditional entropy of X given Y is given by 

H{X\Y) = -Ex.Y [log P(a;|y)] = - ^ P{x, y) log P{x\y) , 

x,y 

where Px.y denotes the joint probability distribution of X and Y. 

4.3 Mutual Information 

In this section we will introduce the notion of mutual information, an entropy-based measure 
for the information shared between two (or more) random variables. Mutual information can be 
thought of as a measure for the mutual dependence of random variables, i.e. as a measure for 
how far they are from being independent. 

We will give two different approaches to this concept in the following: a direct one based on the 
point-wise mutual information i and one using the idea conditional entropy. Note that in essence, 
these are just different approaches to defining the same object. We give the two approaches in 
the following, hoping that they help in understanding the concept better. In Section [4.4| we will 
see yet another characterization in terms of the KuUback-Leibler divergence. 

4.3.1 Point-wise Mutual Information 

In terms of information content, the case of considering two events that are independent is 
straightforward: One of the axioms tells us that the information content of the two events 
occurring together is the sum of the information contents of the single events. But what about the 
case where the events non-independent? In this case we certainly have to consider the conditional 
probabilities of the two events occurring: If one event often occurs given that the other one 
occurs (think of the two events "It is snowing." and "It is winter"), the information overlap is 
higher than when the occurrence of one given the other is rare (think of "It is snowing" and "It 
is summer."). 

Using the notion of information from Section |4.1[ let us express this in a mathematical way 
by defining the mutual information (i.e. shared information content) of two events. We call this 
the point-wise mutual information or pmi. 
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Definition 4.6 (point-wise mutual information) Let x and y be two events of a probability 
space (r2,E,P). Then their point- wise mutual information (pmi) is given as 

■ / ^ 1 P{x,y) 
i{x;y) : = -log 



Pix)Piy) 
P{x\y) 
Fix) 

P{y\x) 
P{y) ■ 



■log 



Note that we used joint probability distribution of x and y is for the definition of i(x; y) to 
avoid the ambiguities introduced by the conditional distributions. Yet, the latter are probably 
the easier way to gain a first understanding of this quantity. 

Let us note that this measure of shared information is symmetric {i{x;y) = i(y;2;)) and that it 
can take any real value, particularly also negative values. Such negative values of point-wise mutual 
inforamtion are commonly referred to as misinformation |64| . Point-wise mutual information 
is zero if the two events x and y are independent and it is bounded above by the information 
content of x and y. More generally, the following inequality holds: 

-oo<i(x;2/) ^min{-logP(a;),-logP(y)}. 

=h(x) =h(y) 

Defining the information content of the co-occurrence of x and y as 

1(2;, y) := -logP(a;,2/), 

another way of writing the point- wise mutual information is 

i{x]y) =i{x) + \{y) -i{x,y), 

^i{x)-\{x\y), (4.4) 

= i(2/)-i(y|2^)> 



where the first identity above is readily obtained from Equation 4.3 by just expanding the 
logarithmic term and in the second and third line the formula for the conditional probability was 
used. 

Before considering mutual information of random variables as a straightforward generalization 
of the above, let us look at an example. 

Say we have two probability spaces {Q.a,^a,Pa) and (rifc,Sh,Pfc), with D.a = {01,02} and 
fjfc = {61,62}- We want to compute the point-wise mutual information of two events Wa 6 ^a and 
Wb € Q,^, subject to the joint probability distributions of uJa and tui, as given in Table [Tl Note that 
the joint probability distribution can also be written as matrix 



P{uJa,UJb) 



I 0.2 0.5 \ 
\ 0.25 0.05 ) 



if we label rows by possible outcomes of cj^ and columns by possible outcomes of uj^,. The 
marginal distributions P{uJa) and P{uJb) are now obtained as row, respectively column sums as 
P{uja = ai) = 0.7, P{LUa = 02) = 0.3, P(wfc = 61) = 0.45, P{LUb = 62) = 0.55. 

We can now calculate the point-wise mutual information of for example 
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Wa 


Wq 


P(x,y) 


ai 


61 


0.2 


ai 


^2 


0.5 


a2 


fol 


0.25 


a2 


62 


0.05 



Table 1: Table of joint probabilities P{uja,^b) of two events uja and uji,. 



i(a2,62) 



■los 



0.05 



0.3-0.55 



and 



i(ai,6i) = -log 



0.2 



0.7-0.45 



1.7 bits, 



-0.65 bits. 



Note again that in contrast to mutual information (that we will discuss in the next section), 
point-wise mutual information can take negative values called, see|64j . 

4.3.2 Mutual Information as Expected Point-wise Mutual Information 

Using point-wise mutual information, the definition of mutual information of two random variables 
is straightforward: Mutual information of two random variables is the expected value of the 
point-wise mutual information of all realizations. 

Definition 4.7 (mutual information) Let X and Y be two discrete random variables. Then 
the mutual information I{X;Y) is given as the expected point-wise mutual information, 



I{X;Y) 



Ex,Y[Kx,y)] 

Y,Y,P{x,y)\{x,y) 



y x 



W'-Hi^} 



(4.5) 



y , ^P{x)P{y)_ 

where the sums are taken over all possible values x of X and y of Y . 

Remember again that the joint probability P{x,y) is just a two-dimensional matrix where 
the rows are indexed by X-values and the columns by K-values and that each row (column) tells 
us how likely each possible value of Y {X) is, given the value x oi X {y of Y) determined by the 
row (column) index. The rows (columns) sum to the marginal probability distributions P{x) 
{P{y)), that can be written as vectors. 

If X and Y are continuous random variables we just replace the sums by integrals and obtain 
what is known as differential mutual information: 



^<^^^> = /.X^(-">'°4?wiw)''^*- 



(4.6) 



Here P{x,y) denotes the joint probability distribution function of X and Y, and P{x) and 
P{y) the marginal probability distribution functions of X and Y , respectively. 

As we can see, mutual information can be interpreted as the information (i.e. entropy) shared 
by the two variables, hence its name. Like point-wise mutual information it is a symmetric 
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H{X) H{Y) 




H{X,Y) 

Figure 13: Venn diagram showing the relation between the entropies H{X) and H{Y), the 
joint entropy H{X,Y), the conditional entropies H{X\Y) and H{Y\X), and mutual information 
I{X,Y). 

quantity I{X;Y) = I{Y;X) and it is non-negative, I{X;Y) > 0. Note though that it is not 
a metric, as in the general case it does not satisfy the triangle inequality. Furthermore we 
have I{X;X) = H{X) and this identity is the reason why entropy is sometimes is also called 
self-information. 

Taking the expected value of Equation |4.3| and using the notion of conditional entropy we can 
define mutual information between two random variables as follows. 

I{X,Y) : = H{X) + H{Y) - H{X,Y), 

= H{X)-H{X\Y), (4.7) 

= H{Y)-H{Y\X), 

where in the last two steps the identity H{X,Y) = H{X) + H{Y\X) = H{Y) + H{X\Y) was 



used. Note that Equation 4.7 is the generalization of Equation |4.4| to the case of random variables. 
See Figure 13 for an illustration of how the relation between the different entropies and mutual 
information. 

A possible interpretation of mutual information of two random variables X and Y is to 
consider it as a measure for the shared entropy between the two variables. 

4.3.3 Mutual Information and Channel Capacities 

We will look at channels in Shannon's sense of communication in the following and relate mutual 
information to channel capacity. But rather than looking at the subject in its full generality, we 
restrict ourselves to discrete, memoryless channels. The interested reader is pointed to |26| for a 
more thorough treatment of the subject. 

Let us take as usual X and Y for the signal transmitted by some sender and received by some 
receiver, respectively. In terms of information transmission we can interpret mutual information 
I{X] Y) as the average amount of information the received signal constrains about the transmitted 
signal, where the averaging is done over the probability distribution of the source signal Px ■ This 
makes mutual information a function of Px and Py\x ^-nd as we know, it is a symmetric quantity. 
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Shannon defines the capacity C of some channel as the maximum amount of information 
that a signal Y received by the receiver can contain about the signal X transmitted through the 
channel by the source. 

In terms of mutual information I{X;Y) we can define the channel capacity as the maximum 
mutual information I{X\Y) among all realizations of the signal X. Channel capacity is thus 
not dependent on the distribution of Px of X but rather a property of the channel itself, i.e. a 
property of the conditional distribution Py\x and as such asymmetric and causal [1121 [55] . 

Note that channel capacity is bound from below by and from above by the entropy H{X) 
of X, with the maximal capacity being attained by a noise-free channel. In the presence of noise 
the capacity is lower. 

We will have a look at channels again when dealing with applications of the theory in Section |6j 

4.3.4 Normalized Measures of Mutual Information 

In many applications one is often interested in making values of mutual information comparable by 
employing a suitable normalization. Consequently, there exists a variety of proposed normalized 
measures of mutual information, most based on the simple idea of normalizing by one of the 
entropies that appear in the upper bounds of the mutual information. Using the entropy of one 
variable as a normalization factor, there a two possible choices and both were proposed: The so 
called coefficient of constraint C{X\Y) |25j 

C{X\Y) := ^-^^ 

and the uncertainty coefficient U{X\Y) [105] 

I{X;Y) 



U{X\Y) :- 



H{X) 



These two quantities are obviously non-symmetric but can easily be symmetrized for example 
by setting 

H{I)U{I\J)+H{J)U{J\I) 
^^"^^■= H{I)+H{J) 

Another symmetric normalized measure for mutual information, usually referred to as redun- 
dancy measure, is obtained when normalizing using the sum of the entropy of the variables 



H{X)+H{Y) 

Note that R takes its minimum of when the two variables are independent and its maximum 
when one variable is completely redundant knowing the other. 

Note that the list of normalized variants of mutual information given here is far from complete. 
But as said earlier, the principle behind most normalizations is to use one or a combination of 
the entropies of the involved random variables as a normalizing factor. 

4.3.5 Multivariate Case 

What if we want to calculate the mutual information between not only between two random 
variables but rather three or more? A natural generalization of mutual information to this so 
called multivariate case is given by the following definition using conditional entropies and is also 
called multi-information or integration |107) . 
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The mutual information for three random variables Xi , . . . , Xg, is given by 

I{Xi;X2;X3):=I{Xi;X2)-I{Xi;X2\X3), 
where the last term is defined as 

I{Xi;X2\X3) ■■= Ex, {I{Xi:X2)\X3) , 

thee latter being called the conditional mutual information of Xi and X2 given X3. The 
conditional mutual information I(Xi] X2\X^) can also be interpreted as the average common 
information shared by Xi and X2 that is not already contained in X3. 

Inductively, the generalization to the case of n random variables Xi , . . . , Xn is straightforward: 

I{Xi] . . . ;Xn) '■= I{Xi; . . . ;X„_i) - I{Xi; . . . ;X„_i|X„), 
where the last term is again the conditional mutual information 

I{Xi; . . . ;Xn-i\Xn) '■= Ex„ (^(-'^i; • • ■ ',Xn-l)\Xn) ■ 

Beware that while the interpretations of mutual information directly generalize from the 
bi-variate case I{X;Y) to the multivariate case I{Xi; . . . ;X„) there is an important difference 
between the bivariate and the multivariate measure. Whereas mutual information I{X;Y) is a 
non-negative quantity, multivariate mutual information (MMI for short) behaves a bit differently 
than the usual mutual information in the aspect that it can also take negative values which makes 
this information-theoretic quantity sometimes difficult to interpret. 

Let us first look at an example of three variables with positive MMI. To make things a bit 
more hands on, let us look at three binary random variables, one telling us whether it is cloudy, 
the other whether it is raining and the third one whether it is sunny. We want to compute 
/(rain; no sun; cloud). In our model, clouds can cause rain and can block the sun and so we have 

/(rain; no sunjcloud) < /(rain; no sun), 

as it is more likely that it is raining and there is no sun visible when it is cloudy than when 
there are no clouds visible. This results in positive MMI for /(rain; no sun; cloud), a typical 
situation for a common-cause structure in the variables: here, the fact that the sun is not shining 
can partly be due to the fact that it is raining and partly due to the fact that there are clouds 
visible. 

In a sense the inverse is the situation where we have two causes with a common effect: This 
situation can lead to negative values for the MMI, see [H7] . In this situation, observing a common 
effect induces a dependency between the causes that did not exist before. This fact is called 
"explaining away" in the context of Bayesian networks, see [51]. Pearl [Mj also gives a car-related 
example where the three (binary) variables are "engine fails to start" {X), "battery dead" {Y) 
and "fuel pump broken" (Z). Clearly, both Y and Z can cause X and are uncorrelated if we 
have no knowledge of the value of X. But fixing the common effect X, namely observing that 
the engine did not start, induces a dependency between Y and Z that can lead to negative values 
of the MMI. 

Another problem with the n-variate case to keep in mind is the combinatorial explosion of 
the degrees of freedom regarding their interactions. As a priori every non-empty subset of the 
variables could interact in an information-theoretic sense, this yields 2""^ degrees of freedom. 
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4.4 A Distance Measure for Probability Distributions: the Kullback- 
Leibler Divergence 

The Kullback-Leibler divergence [57] (or KL-divergence for short) is a kind of "distance measure" 
on the space of probabihty distributions: Given two probabihty distributions on the same base 
space il interpreted as two points in the space of aU probabihty distributions over the base set fi, 
it tells us how far they are "apart" . 

We again use the usual expectation-value construction as used for the entropy before. 

Definition 4.8 (Kullback-Leibler divergence) Let P and Q be two discrete probability dis- 
tributions over the same base space f2. Then the Kullback-Leibler divergence of P and Q is given 
by 

I?KL(P||g):=E^HlogS^- (4.8) 

The Kullback-Leibler divergence is non- negative Dkl{P\\Q) ^ (and it is zero if P equals 
Q almost everywhere), but it is not a metric in the mathematical sense as in general it is non- 
symmetric Dkl{P\\Q) + Dk'l{Q\P) and it does not fulfill the triangle inequality. Note that in 
their original work, Kullback and Leibler j57l defined the divergence via the sum 

^KL(P||Q) + ^KL(g||P), 

making it a symmetric measure. £'kl(-P||Q) is additive for independent distributions, namely 

^KL(P||Q) = i?KL(Pl||Ql) + ^KL(P2||02), 

where the two pairs Pi, P2 and Qi, Q2 are independent probability distributions with the joint 
distributions P = P1P2 and Q = Q1Q2, respectively. 



Note that the expression in Equation 4.8 is nothing else than the expected value £'p[logP - 
log Q] with the expectation value taken with respect to P, which in term can be interpreted 
as "expected distance of P and Q", measured in terms of the information content. Another 
interpretation can be given in the language of codes: I?kl(P||Q) is the average number of extra 
bits needed to code samples from P using a code book based on Q. 

Analogous to previous examples, the KL-divergence can also be defined for continuous random 
variables in a straightforward way via 



DMP\\Q)-lpi.)lo,[^)a., 



where p and q denote the pdf of two continuous probability distributions P and Q. 



Expanding the logarithm in Equation 4.8 we can write the Kullback-Leibler divergence between 



two probability distributions P and Q in terms of entropies as 

DKLiPWQ) = -Ep(logg(x)) +Ep(logp(x)) = 7J"°^^(P,Q) - ff(P), 

where p and q denote the pdf or pmf of the distributions P and Q and H{P, Qy^°^^ is the 
so-called cross- entropy of P and Q given by 

H"°^^(P,Q):=-Pp(logQ). 
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This relation lets us easily compute a closed form of the KL-Divergence for many common 
families of probability distributions. Let us for example look at the value of the KL-Divergence 
between two normal distributions P ~ iV(/ii,o'^) and Q ~ A^(/X2, crl), see Figure 
calculated as 
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This can be 



Z.K.(Pl<,)=<^.i(^-.o4-l) 



Another example: The KL-divergence between two exponential distributions P ~ Exp(Ai) 
and Q ~ Exp(A2) is 

Dkl{P\\Q) = log(Ai) - log(A2) + ^ - 1- 

Using the Kullback-Leibler divergence we can give yet another characterization of mutual 
information: It is a measure of how far two measured variables are from being independent, this 
time in terms of the Kullback-Leibler divergence. 

I{X,Y)=H{X)-H{X\Y) 

= - ^P(x)log(P(x)) +^P(x,y)log(P(x|y)) 



-Zn^.y)io,i^) (4.9) 

= DKL{P{x,y)\\P{x)P{y)) 

Thus, mutual information of two random variables can be seen as the KL-Divergence of their 
underlying joint probability distribution from products of their probability distributions, i.e. as a 
measure for how far the two variables are from being independent. 

4.5 Transfer Entropy: Conditional Mutual Information 

In the past, mutual information was often used as a measure of information transfer between 
units (modeled as random variables) in some system. This approach faces the problem that 
mutual information is a symmetric measure and does not have an inherent directionality. In some 
applications this symmetry is not desired though, namely whenever we want to explicitly have 
information about the "direction of flow" of information, for example to measure causality in an 



information-theoretic setting, see Section 6.5 



In order to make mutual information a directed measure, a variant called time-lagged mutual 
inform,ation was proposed, calculating mutual information for two variables including a previous 
state of the source variable and a next state of the destination variable (where discrete time is 
assumed) . 

Yet, as Schreiber |94| points out, while time- lagged mutual information provides a directed 
measure of information transfer, it does not allow for a time-dynamic aspect as it measures the 
statically shared information between the two elements. With a suitable conditioning on the past 
of the variables, the introduction of a time-dynamic aspect is possible though. The resulting 
quantity is commonly referred to as transfer entropy |94| . Its common definition is the following. 
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Figure 14: The probability densities of two Gaussian probability distributions (a) and the quantity 
P{x)\ogP{x)IQ{x) that yields the KL-Divergence when integrated |(b)[ 
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Figure 15: Computing transfer entropy TEy^x from source Y to target X at time i as a measure 
of the average information present in yt about the future state Xt+i- The memory vectors x„ and 
y„ are shown in gray. 

Definition 4.9 (transfer entropy) Let X and Y be discrete random variables given on a 
discrete time scale and let k,l > 1 be two natural numbers. Then the transfer entropy from Y to 
X with k memory steps in X and / memory steps in Y is defined as 



TE 



Y^X 



E PiXn+l,x'^,y'n)^Og 



Xji+i ,x!^ ,y^ 



P{Xn+l\x'^) 



where we denoted by Xn,yn the value of X and Y at time n and by x„ the past k values of X , 
counted from time n on: xj; := (a;„, a;„_i, . . . ,x„_fc+i), and analogously yl^ := {yn,yn-i, ■■■ ,yn-i+i)- 

Ahhough this definition might look complicated at first, the idea behind it is quite simple. It 
is merely the KuUback-Leibler divergence between the two conditional probability distributions 
P(x„+i|a;^) and P{xn+i\x^,yl), 

TEy^x = DKL{P{xn^i\x':)\\P{xn^i\xlyl)), 



i.e. a measure of how far the two distributions are from fulfilling the generalized Markov 



property (see Section 3.8 1 



P(a;„+i|a;^) = P(a;„+i|a;^,y^). 



(4.10) 



Note that for small values of transfer entropy, we can say that Y has little influence on X at 
time t, whereas we can say that information is transferred from 1" to X at time t when the value 
is large. Yet, keep in mind that transfer entropy is just a measure of statistical correlation, see 
Section 16.51 

Another interpretation of transfer entropy is seeing it as a conditional mutual information 
I{Y^''^;X'\X^''\ measuring the average information the source Y constrains about the next state 
X' of the destination X that was not contained in the destination's past X^''' (see (55]) or 
alternatively as the average information provided by the source about the state transition in the 
destination, see [5^151] . 

As so often before, the concept can be generalized to the continuous case [51], although the 
continuous setting introduces some subtleties that have to be addressed. 

Concerning the memory-parameters k and I of the source and the destination, although 
arbitrary choices are possible, the values chosen fundamentally influence the nature of the 
questions asked. In order to get correct measures for systems being far from Markovian (i.e. 
systems which states are not influenced by more than a certain fixed number of preceding system 
states), high values of k have to be used, and for non-Markovian systems the case fc -»• oo has to 
be considered. On the other hand, commonly just one previous state of the source variable is 
considered, setting I = 1 [55], this being also due to the growing data intensity in k and I and the 
usually high computational cost of the method. 
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Note that akin to the case of mutual information, there exist point-wise versions of transfer 
entropy (also called local transfer entropy), as well as extensions to the multivariate case, see [62] 
and bias-corrected estimators [69] , see Section [5] 

5 Estimation of Information-theoretic Quantities 

As we have seen in the preceding sections, one needs to know the full sample spaces and probability 
distributions of the random variables involved in order to precisely calculate information-theoretic 
quantities such as the entropy, mutual information or transfer entropy. But obtaining this data 
is in most cases impossible in reality, as the spaces are usually high-dimensional and sparsely 
sampled, rendering the direct methods for the calculation of such quantities impossible to carry 
out. 

A way around this problem is to come up with estimation techniques that estimate entropies 
and derived quantities such as mutual information from the data. Over the last decades a large 
body of research was published concerning the estimation of entropies and related quantities, 
leading to a whole zoo of estimation techniques, each class having its own advantages and 
drawbacks. So rather than a full overview, we will give a sketch of some central ideas here and 
give references to further literature. The reader is also pointed to the review articles [TU1I75] . 

Before looking at estimation techniques for neural (and other) data let us first give a short and 
swift and painless review some important theoretical concepts regarding statistical estimation. 

5.1 A Bit of Theory Regarding Estimations 

From a statistical point of view, the process of estimation in its most general form can be regarded 
in the following setting: We have some data (say measurements or data obtained via simulations) 
that is believed to be generated by some stochastic process with an underlying non-autonomous, 
i.e. time-dependent, or autonomous probability distribution. We then want to estimate either 
the value of some function defined on that probability distribution (for example the entropy) or 
the shape of this probability distribution as a whole (from which we can then obtain an estimate 
of a derived quantity). This process is called estimation and a function mapping the data to an 
estimated quantities estimator. In this section will will first look at estimators and their desired 
properties and then look at what is called maximum likelihood estimation, the most commonly 
used method for the estimation of parameters in the field of statistics. 

5.1.1 Estimators 

Let X = (xi, . . . , Xn) be a set of realizations of the random variable X that is believed to have a 
probability distribution that comes from a family of probability distributions Pg parametrized by 
a parameter 9 and assume that the underlying probability distribution of X is Pet,^^ ■ 

Let T : a; !->• 0true be an estimator for the parameter 9 with the true value 0tiue- For the value 
of the estimated parameter we usually write ^true '•= T{x). The bias of T{x) is the expected 
difference between 6'truc and 6'true: 

hias{T) := Exik^ne- 9t,n.l 

and an estimator with vanishing bias is called unbiased. 

We usually also want the estimator to be consistent, i.e. we want the estimated value ^truo to 
converge to the value of the true parameter ^truc in probability as the sample x increases in size, 
i.e. as n ->■ oo: 

lim P{\T{X) - ^trucl > e) = 0. 
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Another important property of an estimator is its variance var(T) and an unbiased estimator 
having the minimal variance among all unbiased estimators of the same parameter is called 
ejjicient. 

Yet another measure often used when assessing the quality of an estimator T is its mean 
squared error 

MSE(r) = (bias(r))^ + var(r) 

and as we can see, any unbiased estimator with minimal variance minimizes the mean squared 
error. 

Without further going into detail here, it is noted that there exists a theoretical lower bound to 
the minimal variance obtainable by an unbiased estimator, the Cramer-Rao bound. The Cramer- 
Rao bound sets the variance of the estimator in relation with the so called Fisher information 
(that can be set into relation with mutual information, see pTT llllSj ). The interested reader is 
pointed to [5511^. 

5.1.2 Estimating Parameters: The MeLximum Likelihood Estimator 

Maximum likelihood estimation is the most-widely used estimation technique in statistics and, as 
we will see in the next few paragraphs, a straightforward procedure that in essence tells us what 
the most likely parameter value in an assumed family of probability distributions is, given a set 
of realizations of a random variable that is believed to have a underlying probability distribution 
from the family considered. 

In statistical applications one often faces the following situation: We have a finite set of 
realizations {xi}i of a random variable X. We assume X to have a probability distribution 
/(a;,0truc) in a certain parametrized class of probability distributions {/(a:,0)}e, where the true 
parameter ^true is unknown. The goal is to get an estimate ^truc of ^true using the realizations 
{xi}i, i.e. to do statistical inference of the parameter 9. Let us consider the so called likelihood 
function 

L{9\x) = Pg{X = x) = f{x\e) 

as a function oi 9. It is a measure of how likely it is that the parameter of the probability 
distribution has the value 9, given the observed realization x oi X . In maximum likelihood 
estimation, we look for the parameter that maximizes the likelihood function. This is ^true^ 

^tmc = argmaX(,L(6'|a;). 

Choosing a value oi 9 = ^truo minimizes the KL-divergence between Pg and Pg^,^^ for all 
possible values of 9. The value 0tiuc7 often written as 0mle is called the maximum likelihood 
estimate (MLE for short) of ^truo- 

In this setting, one often not uses the likelihood function directly, but works with the log of 
the likelihood function (this is referred to as log-likelihood). Why? The likelihood functions are 
often very complicated and situated in high dimensions, making it impossible to find a maximum 
of the function analytically. Thus, numerical methods (such as Newton's method and variants or 
the simplex method) have to be employed in order to find a solution. These numerical methods 
work best (and can be shown to converge to a unique solution) if the function they operate on is 
concave (bowl-shaped, where the closed end is on the top). The log- function has the property to 
make the likelihood function concave in many cases, that being the reason why one considers the 
log-likelihood function, rather than the likelihood function directly, see also (SUj. 
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5.2 Regularization 

Having looked at some core theoretical concepts regarding the estimation of quantities depending 
on probability distributions let us now come back to dealing with real data. 

As in real-world data, the involved probability distributions are often continuous and infinite- 
dimensional, the resulting estimation problem is very difficult (if not impossible) to solve in its 
original setting. As a remedy, the problem is often regularized^ i.e. mapped to a discrete, more 
easily solvable problem. This of course introduces errors and often makes a direct estimation of 
the information-theoretic quantities impossible, but even in that simplified model we can estimate 
lower bounds of the quantities that we are interested in. 

The mathematical foundation of this is Shannon's information processing inequality |26| 

I{X:Y)>I{S{X);T{Y)), 

where X and Y are (discrete) random variables and S and T are measurable maps. 

By choosing the mappings S and T as our regularization mappings (you might also regard 
them as parameters) we can change the coarseness of the regularization. The regularization 
can be chosen arbitrarily coarse, i.e. choosing S and T as constant functions, but this of course 
comes with a price. For example in the latter case of constant S and T the mutual information 
I{S{X);S{Y)) would be equal to 0, clearly not a very useful estimate. This means that a trade-off 
between complexity reduction and the quality of the estimation has to be made. In general, there 
exists no all-purpose recipe for this, each problem requiring an appropriate regularization. 

As this is the discretization technique has become the standard method in the neurosciences 
(and also in other fields), we will solely consider the regularized, discrete case in the following and 
point the reader to the review article (TU] concerning the continuous case. 

In the neurosciences, such a regularization technique was also proposed and is known as the 
"direct method" |1041 [TO] . Here, spike trains of recorded neurons are discretized into time bins of 
a given fixed width and the neuronal spiking activity is interpreted as a series of symbols from an 
alphabet defined via the observed spiking pattern in the time bins. 

5.3 Non-parametric Estimation Techniques 

Commonly, two different classes of estimation techniques regarding the shape of probability 
distributions are distinguished. Parametric estimation techniques assume that the probability 
distribution is contained in some family of probability distributions having some prescribed shape 



(see Section 3.7 1. Here, one estimates the value of the parameter from the data observed, whereas 
non-parametric estimation techniques make no assumptions about the shape of the underlying 
distribution. We will solely look at non-parametric estimation techniques in the following as in 
many cases one tries to not assume prior information about the shape of the distribution. 

Histogram-based estimation is the most popular and most-widely used estimation technique. 
As the name implies, this method uses a histogram obtained from the data to estimate the 
probability distribution of the underlying random generation mechanism. 

For the following, assume that we obtained a finite set of N samples x = {xi}i of some real 
random variable X defined over some probability space (ri,I],P). We then divide the domain of 
X into m eN equally sized bins {bi}^ and subsequently count the number of realizations Xj in our 
data set contained in each each bin. Here, the number m of bins can be freely chosen. It controls 
the coarseness of our discretization, where the limit m ^ oo is the contiguous case. This allows 
us to define relative frequencies of occurrences for X with respect to each bin that we interpret 
as estimations p™ (note that we make the dependence on the number of bins m explicit in the 
notation) of the probability of X taking a value in bin bi which we denote by p™ = P{X e bi). 
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Figure 16: Estimation bias for a non-bias corrected histogram-based maximum likelihood estimator 
Hcst of the entropy of a given distribution with true entropy H = 8 bits. Estimated values are 
shown for three different sample sizes N. Adapted from [73], Figure 1. 

The law of large numbers then tells us that our estimated probability values converge to the real 
probabilities as TV ^ oo. 

Note that although histogram-based estimations are usually called non-parametric as they do 
not assume a certain shape of the underlying probability distribution, they do have parameters, 
namely one parameter for each bin, the estimated probability value p™- These estimates p™ can 
also be interpreted as maximum-likelihood estimates oi pi. 

The following defines an estimator of the entropy based on the histogram. It is often called 
"plug in" estimator: 



HMLE{^)--=-T.PT^Ogp^ 



(5.1) 



The are some problems with this estimator iJMLE(^), though. Its convergence to the true 
value H{X) can be slow and it is negatively biased, i.e. its value is almost always below the true 
value H{X), see [83l IH [79l [82] . This shift can be quite significant even for large N, see Figure 16 
and [75] . More specifically, one can show that the expected value of the estimated entropy is 
always smaller than the true value of the entropy. 



Ex[Hule{x)]<H{X), 

where the expectation value is taken with respect to the true probability distribution P. 

Bias generally is a problem for history-based estimation techniques [TH [SSI [H2] and although 
we can correct for the bias, this may not always be a feasible solution [1]. None the less we will 
have a look at a bias-corrected version of the estimator given in Equation |5.1| below. 

As a remedy to the bias problem, Miller and Madow [71] calculated the bias of the estimator 
of Equation |5.1| and came up with a bias-corrected version of the maximum likelihood estimator 
for the entropy, referred to as Miller-Madow estimator: 



Hmm{x) ■■= HuhEix) + 



1 



2N 



where to is an estimate of the number of bins with non-zero probability. We will not go into 
the detail of the method here, the interested reader is referred to [71] . 
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Figure 17: An inforniation-theoretic view on neural systems. Neurons can either act as channels 
in the information-theoretic sense, relaying information about some stimulus or as senders and 
receivers with channels being synapses. 

Another way of bias-correction Hmle{X) is the so called "jack- knifed" version of the maximum- 
likelihood estimator by Efron and Stein [51] : 



iV-1 



N 



Yet another bias-corrected variant of the MLE estimator based on polynomial approximation 
is presented in [79J , for which also bounds on the maximal estimation error were derived. 

In an effort to overcome the problems faced by histogram-based estimation, many new and 
more powerful estimation techniques have emerged over the last years, both for entropy and other 
information-theoretic quantities. As our focus here is to give an introduction to the field, we 
will not review all of those methods here but rather point the interested reader to the literature, 
where a variety of approaches is discussed. There exist methods based on the idea of adaptive 
partitioning of sample space [21 , ones using entropy production rates and allowing for confidence 
intervals [99] , ones using Bayesian methods ^75j .99j .90J and ones based on density estimation 
using nearest-neighbors |55| . along with many more. See |46| for an overview concerning several 
estimation techniques for entropy and mutual information. We note here that in contrast to 
estimations of entropy, estimators of mutual information are usually positively biased, i.e. tend to 
overestimate mutual information. 

6 Information- theoretic Analyses of Neural Systems 

Some time after its discovery by Shannon, neuroscientists started to recognize information theory 
as a valuable mathematical tool to assess information processing in neural systems. Using 
information theory, several questions regarding information processing and the neural code can 
be addressed in a quantitative way, among those 

• how much information single cells or populations carry about a stimulus and how this 
information is coded, 

• what aspects of a stimulus are encoded in the neural system and 

• how "effective connectivity" [40] in neural systems can be defined via causal relationships 
between units in the system. 



See Figure 17 for an illustration of how Shannon's theory can be used in a neural setting. 
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Attneave 'Bj and Barlow ^ were the first to consider information processing in neural systems 
from an information-theoretic point of view. Subsequently, Eckhorn and Popel |32l I33j applied 
information-theoretic methods to electrophysiologically recorded data of neurons in a cat. But 
being data-intensive in nature these methods faced some quite strong restrictions during that 
time, namely the limited amount of computing power (and computer memory) and the limited 
amount (and often low quality) of data obtainable via measurements at that time. 

But over the last decades, available computing became more and more available and old 
measurement techniques were improved, along with new ones emerging such as fMRI, MEG and 
calcium imaging. This made information theoretic analyses of neural systems more and more 
feasible and through the invention of recording techniques such as MEG and fMRI it is nowadays 
even possible to perform such analyses on a system-scale for the human brain in vivo. Yet, even 
with the newly available recording techniques today there are some conceptual difficulties with 
information-theoretic analyses as it is often a challenge to obtain enough data in order to get 
good estimates on information-theoretic quantities. Special attention has to be paid to using the 
data efficiently and the validity of such analyses has to be assessed to their statistical significance. 

In the following we will discuss some conceptual questions relevant when regarding information 
theoretic analyses of neural systems. More detailed reviews can be found in [T^ 11111 [551 155] . 

6.1 The Question of Coding 

Marr described "three levels at which any machine carrying out an information-processing task 
must be understood" [68, [Chapter 1.2]. They are: 

1. Computational theory: What is the goal of the computation, why is it appropriate, and 
what is the logic of the strategy by which it can be carried out? 

2. Representation and algorithm: How can this computational theory be implemented? In 
particular, what is the representation for the input and output, and what is the algorithm 
for the transformation? 

3. Hardware implementation: How can the representation and algorithm be realized physically? 

Particularly, when performing an information-theoretic analysis of a system one faces the 
fundamental problem related to the coding of the information: In order to calculate (i.e. estimate) 
information theoretic quantities, one has to define a family of probability distributions over the 
state space of the system, each member of that family describing one system state that is to 
be considered. As we know, all information-theoretic quantities such as entropy and mutual 
information (between the system state and the state of some external quantity) are determined 
by the probability distributions involved. The big question now is how to define the system state 
in the first point, a question which is especially difficult to answer in the case of neural systems 
on all scales. 

One possible way to build such a probabilistic model for a sensory neurophysiological experi- 
ment involving just one neuron is the following. Typically, the experiment consists of many trials, 
where per trial i = 1, . . . ,n in some defined time window a stimulus Si is presented eliciting a 
neural response R{Si) consisting of a sequence of action potentials. Presenting the same stimulus 
S many times allows for the definition of a probability distribution of responses R{S) of the 
neuron to a stimulus S. This is modeled as a conditional probability distribution Pms- As noted 
earlier, we usually have no direct access to Pr\s but rather have to find an estimate Pii\s from 
the available data. Note that in practice, usually the joint probability distribution P{R,S) is 
estimated and estimates of conditional probability distributions are subsequently obtained from 
the estimate of the joint distribution. 
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Let us now assume that the stimuh are drawn from the set of stimuli S = {Si, . . . , Sk} according 
to some probabihty distribution Ps (that can be freely chosen by the experimenter). We can then 
compute the mutual information between the stimulus ensemble S and its elicited response R{S) 

I{S; R{S)) := H{R{S)\S) - H{S) = H{S\R{S)) - H{R{S)) 



4.3 



using the probability distributions Ps and Prjs, see Section 

As usual, by mutual information we assess the expected shared information between the 
stimulus and its elicited response averaged over all stimuli and responses. In order to break 
this down to the level of single stimuli we can either consider the point-wise mutual information 
or employ one of the proposed decompositions of mutual information such as stimulus-specific 
information or stimulus- specific surprise, see [20j for a review. 

Having sketched the general setting let us come back to the question of coding of information 
by the neurons involved. This is important as we have to adjust our model of the neural responses 
accordingly, the goal being to capture all relevant features of the neural response in the model. 

Regarding neural coding, there are two main hypotheses of how single neurons might code 
information: Neurons could use a rate code, i.e. encode the information via their mean firing rates 
neglecting the timing patterns of spikes or they could employ a temporal code, i.e. a code where 
the precise timing of single spikes plays an important role. Yet another hypothesis would be that 
neurons code information in bursts of spikes, i.e. groups of spikes emitted in a small time window, 
which is a variant of the time code. For questions regarding coding in populations see the review 

m- 

Note that the question of neural coding is a highly debated one in the neurosciences as of 
today (see [96l [42]) and we do not want to favor one view point over the other in the following. 
As with many things in nature there does not seem to be a clear black an white picture regarding 
neuronal coding. Rather it seems that a gradient of different coding schemes is employed 
depending on which sensory system is considered and at which stage of neuronal processing, see 

gam mi Hang. 

6.2 Computing Entropies of Spike Trains 

Let us now compute the entropy of spike trains and subsequently single spikes, assuming that 
the neurons we model either employ a rate or a time code. We are especially interested in the 
maximal entropy attainable by our model spike trains as these can give us upper bounds for 
the amount of information such trains and even single spikes can carry in theory. The following 
examples here are adapted from [108] . Concerning the topics of spike trains and their analysis, 
the interested reader is also pointed to |92) . 

First, we define a model for the spike train emitted by a neuron measured for some fixed 
time interval of length T. We can consider two different models for the spike train, a continuous 
and a discrete one. In the continuous case, we model each spike by a Dirac delta function and 
the whole spike train as a combination of such functions. The discrete model is obtained from 
the continuous one by introducing small time bins of size Ai in a way that one bin can at most 
contain one spike, say At = 2 ms. We then assign to each bin in which no spike occurred a value 
of and ones in which a spike occurred a value of 1, see Figure [T8| 

Let us use this discrete model for the spike train of a neuron, representing a spike train as a 
binary string S in the following. Fixing the time span to be T and the bin width to be At, each 
spike train S has length N = T/At. We want to calculate the maximal entropy among all such 
spike trains S, subject to the condition that the number of spikes in S" is a fixed number r < N 
which we call the spike rate of S. 



45 



binary string 0101000010001001000 



spikes 



At 



Figure 18: Model of a spike train. The binary string is obtained through a binning of time. 



Let us now calculate the entropy in the firing pattern of a neuron of which we assume that 
spike timing carries important information, i.e. a neuron employing a time code. In order to keep 
the model simple, let us further assume that the spiking behavior is not restricted in any way, i.e. 
that all possible binary strings S are equiprobable. Then we can calculate the entropy of this 
uniform probability distribution P as 



H{P) 



'-€) 



(6.1) 



where ( ) denotes the binomial coefficient ( ) = , ^ , the number of all distinct binary 

strings of length N having exactly r non-zero entries. The entropy in Equation |6.1| can be written 



H{P) 
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(6.2) 



where In denotes the natural logarithm to the base e. The expression in Equation |6.2| is 
obtained by using the approximation formula 



log(")^J^log(^)-(l-^)log(l-^)) 
\k/ \n n n n I 



which is valid for large n and fc and in turn based on Stirling's approximation formula for 
Inri!. 

See Figure 19 ^ for the maximum entropy attainable by the time code as a function of bin 
size Ai for different firing rates r. 

On the other hand, modeling a neuron that reacts to different stimuli with a graded response 
it its firing rate is usually done using a rate code. Assuming a rate code where the timing of spikes 
does not play any role yields different results, as we will see in the following, see Figure [T9j3. In 
the rate code only the number of spikes TV occurring in a given time interval of length T matters, 
i.e. we consider probability distributions Pj^j' parametrized by N and T describing how likely 
the occurrence of A'^ spikes in a time window of length T is. Being well-backed with experimental 
data [1011 Y2M. [71] , a popular choice of Pjv.t is taking a Poisson distribution with some fixed mean 
N = r -T^ where r is thought of as the mean firing rate of the neuron. 

The probability PN,T{n) of observing n spikes in an interval of length T now is given by the 
pmf of the Poisson distribution 



PN,T{n) 



iV"e-^ 



and the entropy of P/v,t computes as 
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Figure 19: Maximum entropy per spike for spike trains. (A) Time code with different rates r as a 
function of the size Ai of the time bins. (B) Rate code using Poisson and exponential spiking 
statistics. Figure adapted from |108) Fig. D.4. 



H{Pn,t) = -Y^PN.T{n)iogPN,T{n)- 

n 

Again using Stirling's formula this can be written as 



H{PN,T) = ^{\ogN-\og2^). 



(6.3) 



Dividing the entropy H(Pnt) by the number of spikes that occurred yields the entropy per 
spike. See Figure [I9j3 for a plot of the entropy per spike as a function of the number of observed 
spikes. 

An interesting question is to ask for the maximal information (i.e. entropy) that spike trains 
can carry, assuming a rate code. Assuming continuous time and prescribing mean and variance of 
the firing rate, this leaves the exponential distribution Poxp as the one with the maximal entropy. 
The entropy of an exponentially distributed spike train with mean rate r = 1/T(e - 1) is 

ff(Pcxp) = fog(l + N) + iVlog(l + 1), 

see also Figure [l9)3. 

Note that while it was possible to compute the exact entropies in the preceding as we assumed 
full knowledge of the underlying probability distributions. This is of course not the case for 
data obtained by recordings. Here the estimation of entropies faces the bias-related problems of 
sparsely sampled probability distributions as discussed earlier. Concerning entropy estimation in 
spike trains the reader is also pointed to [5^ . 

6.3 Efficient Coding? 



The principle of efficient coding [6l |9l llOOj (also called Infomax principle) was first proposed 
by Attneave and Barlow. It views the early sensory pathway as a channel in Shannon's sense 
and postulates that early sensory systems try to maximize information transmission under the 
constraint of an efficient code, i.e. that neurons maximize mutual information between a stimulus 
and their output spike train, using as few spikes as possible. This minimization of spikes for a 
given stimulus results in a maximal compression of the stimulus data, minimizing redundancies 
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between different neurons on a population level. One key prediction of this optimality principle 
is that neurons involved in the processing of stimulus data (and ultimately the whole brain) is 
adapted to natural stimuli, i.e. some form of natural (and structured) sensory input such as 
sounds or images rather than noise. For some sensory systems it could be shown that there is 
strong evidence that early stages of processing indeed perform an optimal coding, see e.g. |77j . 
While in the beginning mainly the visual system was studied and it was shown that the Infomax 
principle holds here j^, other sensory modalities were also considered in the following years 

[niEglEolEIllloglini]. 

But whereas the Infomax principle could explain certain experimental findings in the early 
sensory processing stages, the picture becomes less clear the more upstream the information 
processing in neural networks is considered. Here, other principles were also argued for, see for 
example [i5] . 

On the system-level, Friston et al. |38[ I36j proposed an information theoretic measure of free 
energy in the brain, that can be understood as generalization of the concept of efficient coding. 
Also arguing for optimal information transfer, Norwich |76j gave a theory of perception based 
on information-theoretic principles. He argues that the information present in some stimulus 
is relayed to the brain by the sensory system with negligible loss. Many empirical equations of 
psychophysics can be derived from this model. 

6.4 Scales 

There are many scales at which information-theoretic analyses of neural systems can be performed. 
From the level of a single synapse [301 US] over the level of single neurons [221 [S3] over the 
population level [571 [311 (SD] [HI [3S] up to the system level [1101 [75] . In the former cases the 
analyses are usually carried out on electrophysiologically recorded data of single cells, whereas on 
the system level data is usually obtained by EEG, fMRI or MEG measurements. 

Notice that most of the information-theoretic analyses of neural systems were done for early 
stages of sensory systems, focusing on the assessment of the amount of mutual information between 
some stimulus and its neural response. Here different questions can be answered, about the 
nature and efficiency of the neural code and the information conveyed by neural representations of 
stimuli, see [12 [SI [13 IS3] . This stimulus-response-based approach has already provided a lot of 
insight into the processing of information in early sensory systems, but things get more and more 
complicated the more downstream an analysis is performed [22) 193] where the internal dynamics of 
the neural system play an increasingly prominent role. As a result, stimulus-response relations are 
less prominent compared to earlier processing stages of the system making information theoretic 
analyses harder to carry out. 

On the systems level, the abilities of neural systems to process and store information are due 
to interactions of neurons, populations of neurons and sub-networks. As these interactions are 
highly-nonlinear and in contrast to the early sensory systems neural activity is mainly driven by 
the internal network dynamics (see [1101 [3] ) , stimulus-response- type models often are not very 
useful here. Here, transfer entropy has proven to be a valuable tool here, making analyses of 
information transfer in the human brain in vivo possible jllOl 178] . Transfer entropy can also be 
used as a measure for causality, as we will discuss in the next section. 

6.5 Causality in the neurosciences 

The idea of causality, namely the question of what are the causes resulting in the observable state 
and dynamics of complex systems of physical, biological or social nature is a deep, philosophical 
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question that has been driving scientists in all fields ever since. In a sense this question lies at 
the heart of science itself and as such is often notoriously difficult to answer. 

In the neurosciences, this principle is related to one of the core questions of neural coding 
and subsequently neural information processing: What stimuli make neurons spike (or change 
their membrane potential, for non-spiking neurons)? For many years now, neuroscientists have 
investigated neurophysiological correlates of information presented to a sensory system in form of 
stimuli. 

While considerable progress has been made regarding the answer to this question in the early 
stages of sensory processing (see the preceding sections) , where often a clear correlation between 
a stimulus and the resulting neuronal activity could be found, things get less and less clear the 
further downstream this question is addressed. In the latter case, neuronal activity is subject to 
higher and higher degrees of internal dynamics and a clear stimulus-response relation is often 
lacking. 

Considering early sensory systems, even though merely a correlation between a stimulus and 
neural activity can be measured, it is justified to speak of causality here, as it is possible to 
actively influence the stimulus and observe the change in neural activity. Note that the idea of 
intervention is crucial here, see [55117]. 

Looking at more downstream systems or at the cognitive level, an active intervention albeit 
possible (but often not as directly as for sensory systems) may not have the same easy to detect 
effects on system dynamics. Here, often just statistical correlations can be observed and in most 
cases it is very hard if not impossible to show that the principle of causality in its purest form 
holds. Yet, one can still make some statements regarding what one might call "statistical causality" 
in this case, as we will see. 

In an attempt to give a statistical characterization of the notion of causality, the mathematician 
Wiener |112| came up with the following probabilistic framing of this concept that came to be 
known as Wiener causality: Consider two stochastic processes X = (Xt)(eN ^nd Y = {Yt)t^m. Then 
Y is said to Wiener-cause X if the knowledge of past values of Y diminishes uncertainty about 
the future values of X. Note that Wiener causality is thus a measure of predictive information 
transfer and not one of causality and thus the naming is a bit unfortunate, see |63j . 

The economist Granger employed Wiener's principle of causality and developed the notion 
of what is nowadays called Wiener- Granger causality [HHTl]. Subsequently, the linear Wiener- 
Granger causality and its generalizations were often employed as measure of statistical causality 
in the neurosciences, see [461 I16j . Another model for causality in the neurosciences is dynamic 
causal modeling [lU 11021 137] . 

In contrast to dynamic causal modeling, causality measures based on information-theoretic 
concepts are usually purely data-driven and thus inherently model-free |46i 1110] . This fact can 
be of advantage in some cases but we do not want to make a judgment here, calling one method 
better per se, as each has its advantages and drawbacks [55] . 

The directional and time-dynamic nature of transfer entropy allows using it as a measure of 
Wiener-causality, as was proposed in the field of neurosciences recently [110] . As such, transfer 
entropy can be seen as a non-linear extension of the concept of Wiener-Granger causality, see [55] 
for an comparison of transfer entropy to other measures. 

Note again that transfer entropy still essentially is a measure of conditional correlation rather 
than one of direct effect (i.e. causality) and that correlation is not causation. Thus it is a 
philosophical question to which extent transfer entropy can be used to infer some form of causality, 
a question that we will not further pursue here, rather pointing the reader to [85] US [Ml [I] ■ 

In any case the statistical significance of the inferred causality (remember that transfer entropy 
just measures conditional correlation) has to be verified. For trial based data-sets as often found 
in the neurosciences, this testing is usually done against the null-hypothesis Hq of average transfer 
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entropy obtained by random shuffling of the data. 

6.6 Information-theoretic Aspects of Neural Dysfunction 

Given the fact that information-theoretic analyses can provide us with insights about the 
functioning of neural systems, the next logical step is to ask how these this might help us in better 
understanding neural dysfunction and neural diseases, maybe even giving hints to new treatments. 
Facing the demographic change challenge with the constantly growing group of elderly persons 
and as a result the rise of the growing number of people suffering from age-related neuronal 
diseases such as Alzheimer's and Parkinson's, research on possible treatments of such diseases is 
of prime importance. 

The field one might call "computational neuroscience of disease" is an emerging field of 
research within the neurosciences, see the special issue of Neural Networks [28j . The discipline 
faces some hard questions as in many cases dysfunction is observed on the cognitive (i.e. systems-) 
level but has causes on many scales of neural function (sub-cellular, cellular, population, system). 

Over the last years, different theoretical models regarding neural dysfunction and disease were 
proposed, among them computational models applicable to the field of psychiatry [48l [72] , models 
for brain lesions [T] , models of epilepsy [3] , models for deep brain stimulation [7D1 [55] , models for 
aspects of Parkinson's |3S1 [73] and Alzheimer's [llj i56j disease, of abnormal auditory processing 
[211 [55] and for congenital prosopagnosia (a deficit in face identification) |103j . 

Some of these models employ information-theoretic ideas in order to assess differences between 
the healthy and dysfunctional states [103^ i8j. For example, information-theoretic analyses of 
cognitive and systems- level processes in the prefrontal cortex were carried out recently |531 [5] and 
differences in information-processing could be assessed between the healthy and dysfunctional 
system by means of information-theory [5] . 

Yet, computational neuroscience of disease is a very young field of research and it remains 
to be elucidated if and in what way analyses of neural systems employing information-theoretic 
principles could be of help in medicine on a broader scale. 

7 Software 

There exist several open source software packages that can be used to estimate information- 
theoretic quantities of neural data. The list below is by no means complete, but should give a 
good overview of things, see also [i^ . 

• entropy: Entropy Eind. Mutual Information Estimation 
URL: 'http: //craii.r-project . org/web/packages/ entropy] 
Authors: Jean Hausser and Korbinian Strimmer. 
Type: R package 

From the website: This package implements various estimators of entropy, such as the 
shrinkage estimator by Hausser and Strimmer, the maximum likelihood and the Millow- 
Madow estimator, various Bayesian estimators, and the Chao-Shen estimator. It also offers 
an R interface to the NSB estimator. Furthermore, it provides functions for estimating 
mutual information. 

information-dynamics-toolkit 



URL: jhttp : //code . google . com/p/inf ormation-dynamics-toolkit 



Author: Joseph Lizier 

Type: standalone Java software 
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From the website: Provides a Java implementation of information-theoretic measures 
of distributed computation in complex systems: i.e. information storage, transfer and 
modification. Includes implementations for both discrete and continuous-valued variables 
for entropy, entropy rate, mutual information, conditional mutual information, transfer 
entropy, conditional/complete transfer entropy, active information storage, excess entropy / 
predictive information, separable information. 

• ITE (Information Theoretical Estimators) 



URL: https : //bitbucket . org/szzoli/ite/ 

Author: Zoltan Szabo 
Type: Matlab/Octave plugin 

From the website: ITE is capable of estimating many different variants of entropy, mutual 
information and divergence measures. Thanks to its highly modular design, ITE supports 
additionally the combinations of the estimation techniques, the easy construction and 
embedding of novel information theoretical estimators, and their immediate application in 
information theoretical optimization problems. ITE can estimate Shannon-, Renyi entropy; 
generalized variance, kernel canonical correlation analysis, kernel generalized variance, 
Hilbert-Schniidt independence criterion. Shannon-, L2-, Renyi-, Tsallis mutual information, 
copula-based kernel dependency, multivariate version of Hoeffding's Phi; complex variants of 
entropy and mutual information; L2-, Renyi-, Tsallis divergence, maximum mean discrepancy, 
and J-distance. ITE offers solution methods for Independent Subspace Analysis (ISA) and 
its extensions to different linear-, controlled-, post nonlinear-, complex valued-, partially 
observed systems, as well as to systems with nonparametric source dynamics. 

PyEntropy 

URL: http : //code . google . com/p/pyentropy 

Authors: Robin Ince, Rasmus Petersen, Daniel Swan, Stefano Panzeri 

Type: Python module 

From the website: pyEntropy is a Python module for estimating entropy and information 

theoretic quantities using a range of bias correction methods. 

Spike Train Analysis Toolkit 

URL: http : //neuroanalysis . org/toolkit| 



Authors: Michael Repucci, David Goldberg, Jonathan Victor, Daniel Gardner 

Type: Matlab/Octave plugin 

From the website: Information theoretic methods are now widely used for the analysis of 

spike train data. However, developing robust implementations of these methods can be 

tedious and time-consuming. In order to facilitate further adoption of these methods, we 

have developed the Spike Train Analysis Toolkit, a software package which implements 

several information-theoretic spike train analysis techniques. 

TRENTQOL 

URL: http : //trentool . de' 

Authors: Michael Lindner, Raul Vicente, Michael Wibral, Nicu Pampu and Patricia 

WoUstadt 

Type: Matlab/Octave plugin 

From the website: TRENTOOL uses the data format of the open source MATLAB toolbox 

Fieldtrip, that is popular for electrophysiology data (EEG/MEG/LFP). Parameters for 

delay embedding are automatically obtained from the data. TE values are estimated by 

the Kraskov-Stogbauer-Grassberger estimator and subjected to a statistical test against 
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suitable surrogate data. Experimental effects can then be tested on a second level. Results 
can be plotted using Fieldtrip layout formats. 
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